Laura Hannemose
Rieger
a,
Max
Wilson
b,
Tejs
Vegge
a and
Eibar
Flores
*c
aDepartment of Energy Conversion and Storage, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark
bDepartment of Applied Mathematics and Computer Science, Technical University of Denmark, Kgs. Lyngby, Denmark
cSustainable Energy Technology, SINTEF Industry, Sem Sælands Vei 12, Trondheim, 7034, Norway. E-mail: eibar.flores.cedeno@sintef.no
First published on 10th November 2023
Analysing spectra from experimental characterization of materials is time consuming, susceptible to distortions in data, requires specific domain knowledge, and may be susceptible to biases in general heuristics under human analysis. Recent work has shown the potential of using neural networks to solve this task, and assist spectral interpretation with automated and unbiased analysis on-the-fly. However, the black-box nature of most neural networks poses challenges when interpreting which patterns from the data are being used to make predictions. Understanding how neural networks learn is critical to assess their accuracy on unseen data, justify critical decision-making based on predictions, and potentially unravel meaningful scientific insights. We present a 1D neural network to classify infrared spectra from small organic molecules according to their functional groups. Our model is within range of state-of-the-art performance while being significantly less complex than previously used networks reported in the literature. A smaller network reduces the risk of overfitting and enables exploring what the model has learned about the patterns in the spectra that relate to molecular structure and composition. With a novel two-step approach for explaining the neural network's classification process, our findings not only demonstrate that the model learns the characteristic group frequencies of functional groups, but also suggest it uses non-intuitive patterns such as tails and overtones when classifying spectra.
Virtually all spectra share the same data structure: an array of intensity counts (e.g., photons, electrons), and its corresponding array of indexes, i.e., the scanning variable such as absorption energy, scattering angle, or wavelength. Spectroscopic data differ from other more traditional data sources by being high-dimensional (each spectrum typically has hundred/thousand data points) and exhibiting both local (peaks and troughs) and non-local patterns (multiple peak belonging to the same compound or chemical environment), as shown in Fig. 1. Both types of patterns are needed to map a spectrum to a sample's properties.
Traditionally, spectral analysis involves comparing peaks and patterns to experimental databases,5 to simulations,6 or by visual inspection following heuristic rules.7 These methods can be time consuming, require specific domain knowledge, and are susceptible to spectral artifacts (Fig. 1) from, e.g., noise, outliers and baseline drift. Multiple correction algorithms have been proposed for spectral analysis.8,9 However, these traditional preprocessing techniques involve manual tuning of parameters and so are also susceptible to human biases. More importantly, even if the spectra can be successfully preprocessed, it is still highly challenging to establish spectrum–property relationships in the absence of accurate simulations and reference compounds. In the hope of addressing some of these issues, machine learning (ML) methods have been applied to the problem of spectra analysis. These algorithms can learn to directly map spectral patterns to material properties from a preexisting dataset.10–12 ML methods establish a statistical model between a large number of spectra and a property of interest, and potentially bypass the need for physics-based simulations and curated databases. Classification algorithms learn to assign spectra to one of several classes (e.g., species of bacteria13), while regression algorithms map spectra to a numerical property (e.g., the concentration of a specific chemical or compound within a mixture14). They can be susceptible, in the same way, to distortions or biases in the dataset. Addressing these artifacts manually becomes intractable given the large number of spectra needed to train an ML algorithm; instead, artifacts are mitigated in a high-throughput fashion with a variety of preprocessing routines, such as noise reduction15 and baseline correction.16–18
Convolutional neural networks (CNN) are a well-established type of neural network for recognizing patterns in images.19,20 CNNs can overcome the need of hand-engineering the data before training, since they learn low-dimensional patterns directly from the high-dimensional input. The low dimensional patterns are hierarchically combined into increasingly complex patterns. For instance, segments are combined into lines, and subsequently combined into shapes. In images, the shapes could be tails, ears and eyes to classify images of cats from dogs, or circles, arcs and triangles to classify images of hand-written digits.21–23 The hierarchical pattern recognition approach that CNNs leverage to classify images shows clear similarities to how experts infer properties from spectra. Spectroscopists typically search for lines that form baselines and peaks, in turn peaks that become shoulders, doublets, sextets, etc.; all while noting the position of these patterns within an energy axis. Naturally, such expert-based analysis is very time consuming, and constitutes a bottleneck in the data analysis from new high-throughput spectroscopic analysis techniques. It is no surprise that CNNs have been increasingly applied to the problem of spectra analysis24–30 and shown to perform comparatively better than other ML methods.31–36
CNNs share with other neural network methods a critical disadvantage: they are difficult to interpret. Given the dense connectedness of the network architecture, it becomes increasingly challenging to investigate the relations between inputs and outputs that lead to successful predictions. Understanding how the network learns is equally important as its accuracy, for three main reasons. One is assessing how well the network performs on unseen data. As with any ML model, the predictive power of a CNN might suffer when the input data is significantly different from the data used for training. CNN models might confuse spurious spectral artifacts for meaningful patterns. Therefore, ensuring that the model is learning patterns that carry physical information provides further guarantee on its ability to generalize. Secondly, interpretability enables delivering meaningful scientific outcomes. Using NNs for scientific discovery requires understanding how the learned patterns build upon existing scientific principles.37 The third reason pertains to justifying critical decision-making. When a network predicts an outcome that calls for a crucial decision, e.g., analysis and medical diagnosis of cancer specimens using spectroscopic imaging,38,39 data scientists must provide a satisfactory interpretation to justify a course of action. These interpretations have previously unraveled unexpected and undesirable behavior of deep learning models. For instance, models can choose shortcuts to make predictions, e.g. hospital metadata markers in lung X-ray images instead of using lung features to detect COVID-19.40,41
In this study, we use a convolutional neural network (CNN) to classify spectra, with a focus on interpreting the type patterns the network learns for making predictions. We use an openly available dataset of infrared spectra from small molecules and build a small CNN to classify the presence or absence of functional groups from spectra. Our CNN accuracy is within the range of the state-of-the-art while being comparatively less complex as measured by the number of parameters. Being smaller, our model is easier to interpret and more robust against overfitting. It is widely accepted in machine learning research that while simple models are not immune to overfitting, complex models are more prone to overfit the training data.42
This work builds upon previous research by Judge et al.,43 who used sensitivity factors computed from a trained multi-layered perceptron (MLP) to model the relation between a spectrum and the presence of a functional group. These sensitivity factors, interpreted as saliency profiles, highlighted regions of the spectra that were most relevant for predicting the presence of a functional group.43 Fine et al.44 implemented an autoencoder network followed by several fully connected layers, and used guided backpropagation to identify the regions of the spectra used to make classification of each functional group.45 Enders et al.46 used a 2D CNN to classify images of infrared and mass spectra, without assessing which patterns result in more confident predictions. Instead, in our work we focus not only on predictive performance but also the robustness of the prediction along with the interpretability. Similar to the concept bottleneck models introduced in Koh et al.,47 our model features a ‘bottleneck’ where human understandable features are converted into the final output classification. In contrast to their work, our model does not require the relevant features to be known beforehand or for all training inputs to be annotated with all features, since concept or prototype features are learnt automatically in our model. In this way the model might autonomously discover important features as these do not need to be known beforehand. Our approach of designing an inherently interpretable model by a bottleneck through which information must flow circumvents the known problems with the reliability of post-hoc attribution methods.48,49 To our knowledge this presents a novel method to explain the classifications of convolutional neural networks.
We implemented a simple 1-D CNN50 with convolutional layers followed by a single fully connected layer. Our contribution here is two-fold. Our stripped-down CNN architecture predicts functional groups with state-of-the-art performance. Due to the small architecture, we can further examine what features were relevant for the prediction. We show that patterns learnt by our model are aligned with the patterns experts use to assign spectra by visual inspection. In addition, we demonstrate that our approach uncovers new and non-intuitive patterns learnt by the neural network.
To obtain the target functional groups, we decoded the presence of functional groups from the InChI strings using a structural matching algorithm, feed with the substructural patterns expressed in SMILES arbitrary target specification syntax (SMARTS).52 Each functional group is encoded as a SMARTS pattern (see ESI Table S1†), which is used to search whether an InChI string encodes or not a particular functional group. Hence, for each InChI string we searched for 17 different SMARTS patterns, and encoded the results as a 17-element vector of “1”/“0”, representing the presence/absence of functional group in the InChI string. The algorithm matches a single molecule with multiple functional groups if they are present in the molecule; see the example of ethanol in Fig. S3.† This curated data pool was subsequently split randomly into 5687 (70%), 1218 (15%) and 1220 (15%) samples used for training, validation and testing, respectively. Data are then shuffled and randomly distributed to training, validation and test set. Spectra are normalised relative to their own maxima to be between 0 and 1.
For an extensive explanation of CNNs and common practices we refer to Goodfellow et al.53 Our model can conceptually be described as a feature extractor followed by a classifier, as shown in Fig. 2. The feature extractor consists of several 1-D convolutional layers that capture local and global patterns in a spectrum. Early convolutional layers learn simple patterns that are combined to recognise complex patterns in the subsequent layers. We use an architecture composed of convolutional layers with ReLU non-linearities and batch normalization54 interspersed with MaxPooling operations.21 The feature extractor outputs a latent matrix, i.e. a condensed representation of a spectrum, as learned during the training process. Each row in the matrix, a channel, models a specific spectral pattern (e.g. a peak). Each column in the matrix, a feature vector, attends to a specific spectral region; such region is called a receptive field. Stacking convolutional filters and MaxPool layers sequentially, effectively expands the size of the receptive field that each feature vector receives input from. The latent matrix is then flattened into a 1D vector by a Flatten layer and then fed to a classifier. The classifier consists of a single fully connected layer with sigmoid activation, which maps the flattened latent matrix into a format suitable to classify the presence of functional groups. The architecture of the CNN can be summarized in the following notation‡:
Conv(1, 4, 3, 1), ReLU, BatchNorm, Conv(4, 4, 3, 1), ReLU, BatchNorm, MaxPool(2),Conv(4, 8, 3, 1), ReLU, BatchNorm, Conv(8, 8, 3, 1), ReLU, MaxPool(2), BatchNorm, Flatten, Dense(1672, 17).
For the convolutional layers Conv(i, o, k, s), i, o, k, s, represent the number of input channels, output channels, kernel size and stride, respectively. For the MaxPool layers the numbers indicate the kernel size and stride. For the dense layer the first and second number indicate the number of inputs and outputs respectively. Zero padding is used for all layer types if it is applicable. The number of outputs of the dense layer is defined by the number of functional groups to be classified. The number of inputs to the dense layer is given by the number of channels, 8, times the length of the convolutional output, 209. Each receptive field is 16-elements wide with a stride of 4 elements, such that two adjacent inputs have an overlap of 12 elements. For a detailed explanation on how to compute the size of the receptive field for convolutional neural networks we refer to Araujo et al.55
Training the CNN involves updating its weights to improve its accuracy at detecting the presence of functional groups in the spectrum. In such a training scheme, gradients are computed for a batch at each update step by back-propagating the sum of the derivatives of the binary cross-entropy loss function as in eqn (1), applied to the value of the functional group prediction node yi for functional group index i, in a way equivalent to logistic regression. The loss is computed using the binary cross-entropy function:
(1) |
Some functional groups are only rarely present in the dataset, e.g., only 1.1% of the spectra originate from amide-containing molecules. As a result, the loss from the classes should be balanced during training such that their weights receive equal update signals from spectra with and without the functional group, i.e. spectra from the rarer group should be weighted more heavily. The proportion pj of class j is where N is the size of the training set, and yi,j is the class label for class j of sample i. We rebalance the loss resulting between labels indicating present/not present for the functional groups by introducing a loss reweighting factor that will reduce the loss impact of overrepresented functional groups. Overall, the loss reweighting factor becomes
(2) |
(3) |
The model training was stopped after the validation loss had not improved for five concurrent epochs. Gradient updates were computed with the Adam optimization algorithm.56
(4) |
If the dataset is imbalanced, the overall accuracy may not give a complete description of the model performance. The model might perform well for some functional groups and poorly for others, as seen in Table 1. Computing additional metrics provides additional context on the way the model is performing. Precision quantifies the proportion of positive predictions made that were correct:
(5) |
Recall quantifies the proportion of positive labels that were correctly predicted,
(6) |
The F-1 score is the harmonic mean of the precision and recall,
(7) |
The AUC (area under the curve) is calculated as the area under the ROC (receiver operating characteristic) curve. The ROC curve is plotted as the rate of true positives over the rate of false positives when varying the classification threshold from 0 to 1. A model predicting the wrong result every time will get an AUC score of 0, a model predicting the right result every time will get an AUC score of 1. The AUC score is often used to indicate performance for imbalanced datasets as it is not dependent on the balance between the two classes.
To identify which patterns are typical and indicative for functional groups, we examine representative examples of those patterns from the dataset. For each channel, typical examples are subsets of a spectrum that maximize the corresponding feature variable in that channel. To identify those, we first extract the output of the final convolutional layer for the entire test set, resulting in a matrix with the shape [nSamples, nChannels, nFeatures] with nFeatures being the number of positional outputs of the last convolutional layer, i.e. the size of a channel vector.
In the first step, we identify for each channel the 100 input samples with the highest output feature variables for this channel in the test set.§ The receptive fields causing the highest activation are linked to the feature variable activations by the structure of the neural network. We identify the most important patterns in the receptive fields, i.e. the parts of the spectrum that caused the high activations, and display them in Fig. 3.
The receptive fields span 16 elements in the input vector, which translate to windows of ca. 67 cm−1 width. Hence, as the receptive fields are small enough to consist e.g. of a single peak, the viewer can identify common patterns across a small number of examples as in Fig. 3. In Fig. 3 we show examples for all eight channels of the neural network. In the first row, the receptive fields that caused the highest output values of the feature extractor for the respective channel are displayed. In the rows below, we display examples randomly chosen from these 100 receptive fields. While not identical, it's evident that receptive fields within a single channel exhibit consistent patterns.
As the second step, we conduct an examination of the classifier (the final fully connected layer), which uses these patterns and their location in the spectra to carry out classifications. Each neuron in the classifier can be mapped to a channel and a feature vector in the latent matrix, which in turn can be mapped to a pattern (as in Fig. 3) and a wave number range in the input spectra, respectively. Therefore, a neuron exhibiting a high weight indicates that the pattern and spectral region associated to it have high importance when classifying a functional group. The output for each functional group is connected to the channel output of the CNN via [nChannels × nPositions] neurons. In Fig. 5, we visualize the weights for nitrile and alkene as a comparison. In particular, we see that the middle channels is important for both. From Fig. 3, we can identify those as a broad peak and a flat region.
By combining the learnt features and their connection to the prediction for each functional group, we can make concise statements about the patterns the CNN learnt such as “a peak at 2200 cm−1 is strongly correlated with nitrile”. These observational statements allow us to study whether the patterns the network uses compare well to empirical practices in spectrum identification, and further investigate instances where the CNN uses unexpected patterns.
Molecular perfection | Molecular F-1 | |
---|---|---|
Rieger, Wilson and Flores (ours) | 0.59(2) | 0.895(5) |
Fine et al.44 | 0.63 | 0.905 |
Spectra classification is a multi-label problem, where each spectrum can correspond to more than one label (the functional group) and the classes are not mutually exclusive. Hence, to obtain a broader picture of model performance, we also compute two additional metrics, the molecular perfection and molecular F-1 defined in Fine et al.,44 which are designed to capture the ability of a model to correctly predict all functional groups of a molecule. Briefly, Molecular Perfection is the proportion of spectra in the test set for which the prediction of all functional groups is correct. For a definition of the molecular F-1 we refer to Fine et al.44. These metrics are computed for the functional groups listed in Table S1 (see ESI†).
Before analysing this comparison, we note a key difference between the approach used by Fine et al.44 and ours. The authors did not use a test set in their work and instead reported results on the validation set. Here, we randomly split the training/validation/test sets 70/15/15 and report results on the test set, following standard practice in machine learning research. In Table 2, we compare the molecular metrics between our approach and Fine et al.44 directly. We observe that the performance metrics are comparable, although our approach performs slightly worse. In return, our approach allows for human-understandable explanations both for individual decisions as well as patterns the CNN has learnt to classify functional groups. In Section 2.4.1, we explain how we utilize the bottleneck in the feature space to extract insights about the classification from the trained neural network, resulting in slightly worse predictive performance in return for increased interpretability, similar to findings in Koh et al.47.
Finally, our results were computed as the average across 10 seeded models. Table 1 also shows the standard deviation across the seeded models as values within parenthesis, with the precision of the last significant digit of the mean. The standard deviation values are low across most metrics and functional groups, indicating our model architecture yields consistent predictions across random initializations.
Intuitively, if the differently seeded neural networks learnt similar important regions, the vectors with the dimensions 1 × 209 for each functional group will look similar across neural networks. To verify this, we visualize the linear layer weights for two exemplary functional groups in Fig. 6 (the extended version with all functional groups is shown in Fig. S4†). To facilitate better understanding we have mapped the dense layer weights back to the corresponding wave number scale in the input spectra, similar to Fig. 5. Since the inputs of the dense layer are normalized by the final batch normalization to have zero mean and standard deviation one, the dense layer weights indicate how much each region contributes to the final classification.
For each linear layer we plot the mean and standard deviation of the maximum value across the channels. We see that – on average – the CNNs highlight distinctive peaks for methyl and alkene groups, indicating these regions to be important when classifying the functional groups. The low standard deviation (as indicated by the shaded area around the mean) also indicates that all neural networks have learnt that e.g. the region around 3000 cm−1 – characteristic of C–H stretch – is important to classify methyl groups; likewise, the region around 1600 cm−1 – distinctive of CC stretch – is important to classify alkenes.
We further note that the important regions emphasized by our CNNs compare well to those in Fine et al.,44e.g. the “alkene bending motion around 900 cm−1.”. We therefore conclude that CNNs learn consistent patterns across diversely seeded networks to classify functional groups, and these patterns seem related to characteristic group frequencies. In the following sections we will examine these patterns in depth using explainability methods.
However, when we revisit the results for nitriles we note that, while the CNN has clearly learned the location of CN vibrations, the precision, recall and F1 scores of the nitrile classifier are relatively poor (Table 1). Only 4% of the spectra come from nitrile compounds so it is possible that the dataset has too few instances of nitriles to learn other relevant features besides the main CN band. Class imbalance is known to skew the classifier's bias towards the majority class.57 Hence the model misclassifies a substantial number of instances in the minority class, leading to poor precision and recall. This argument is further supported by the high AUC (0.97) which is a metric robust against class imbalance. Upon closer examination, however, even less represented classes such as alkynes and aldehydes (proportion 0.02, Table 1) are classified with better precision and recall scores than nitriles, suggesting that class imbalance is not the only reason for the poor classification metrics of nitriles; separability issues might also be at play. We expect functional groups with more and stronger characteristic IR bands to be easier to identify. Nitriles exhibit only a single and relatively weak distinctive vibration: the CN stretching band. In contrast, both aldehydes and alkynes exhibit several characteristic bands, with comparatively stronger intensities, that might make them easier to classify. In short, the relatively poor classification performance of nitriles might be related not only to class imbalance, but also to an inherent difficulty to classify them as they exhibit only one weak characteristic band. These observations imply that, even if a classifier has learned the group frequencies, it might not produce accurate results given the class imbalance in the dataset and separability challenges between spectra.
Fig. 4 zooms into the first three feature variables of the alkene classifier within 1350–1950 cm−1. The characteristic group frequency of alkene is added for comparison, alongside the patterns that lead to the strongest activation for each channel. We see in the figure that peaks followed by a tail (channels 1 and 3) activate the most above 1620 cm−1. In contrast, patterns of tails followed by a peak (channel 2) activate the most below 1620 cm−1. In all cases, the peak in the patterns is oriented towards the wave number where CC vibrations are expected, while the tails orient away from this region. We interpret these patterns as the CNN learning to emphasize that ca. 1620 cm−1, a single peak – without any neighbors – is a characteristic to classify alkenes. For context, the CC absorption bands ca. 1650 cm−1 are not the only ones within this region; a variety of vibrations from carbonyl CO bonds are also expected. Acyl halides, esters, aldehydes, carboxilic acids, ketones and amides, all share CO absorptions ca. 1650 cm−1. Moreover, these absorptions are highly sensitive to the electronic environment in the vicinity of the bond, often resulting in many overlapping bands with complex shapes. As the region at ca. 1650 cm−1 is crowded with peaks from multiple functional groups, we believe the CNN places extra emphasis on distinguishing peaks from CC vibrations – characteristic of alkenes – from those of CO vibrations. The network thus assigns high weights to tail-peak and peak-tail patterns that outline a single peak with no neighbors.
Not all activated regions on the classifier weights coincide with group frequencies. For instance, carboxylic acids have group frequencies within 1500–1700 cm−1, yet their network weights (Fig. S18†) show strong activations >3500 cm−1. This is expected given that carboxylic acids share with alcohols O–H bonds that vibrate at ca. 3570 cm−1, and both groups co-occur frequently in the dataset (Fig. S23†). Their activation weights are consequently similar.
However, the same argument cannot be made for the nitro and ketone compounds. Their classifier weights share relatively strong activations between 2400 and 2700 cm−1, a region usually devoid of characteristic bands. At this point it is unclear what the CNN has learnt from this region since the region is activated across all channels, meaning it uses most patterns – peaks, peak shoulders, tails – to make classifications. We further investigate whether the network uses spurious patterns such as post-processing artifacts or non-apparent spectroscopic signals.
We visualize the spectra within the 2400–2700 cm−1, and segregate these into two subgroups: one where a particular functional group is present, the other where said functional group is absent. Fig. 7 illustrates the overlap of these subgroups for nitro and ketone groups; Fig. S24† completes the series for all functional groups. The spectra exhibit symmetric peaks of Lorentzian and pseudo-Voight lineshape, as expected from infrared absorption bands. Notably, there are no spurious patterns typical from spectral noise (e.g. outliers) nor from pre-processing (e.g. discontinuities or asymmetrical curves). We conclude that the patterns within the 2400–2700 cm−1 region originate from infrared absorption bands, even if these are not typically considered to be characteristic of functional groups. Bands within the 2400–2700 cm−1 are usually assigned to anharmonic vibrations: overtones that result from higher-energy harmonics of fundamental vibrations, or combination bands that derive from the simultaneous excitation of multiple vibrational modes in the molecule.58,59 These vibrations result in weak absorption bands in the spectrum and so experts seldom used them for molecule identification.7Fig. 7 suggests, however, that these bands are detectable and suitable enough for the CNN to make classification more certain.
The presence of bands within the 2400–2700 cm−1 region does not imply that they effectively discriminate between functional groups. For the case of ketones, the median intensity in Fig. 7 is similar for both spectra with and without the functional group. The role of these bands in the classifier's decision for ketones is unclear: the classifier exhibits high activations in this region, even if the spectra are not very different from each other upon visual inspection. On the other hand, the median lines for nitro subgroups are clearly different: the spectra of nitro-containing molecules exhibit – on average – fewer bands within 2400–2700 cm−1 compared to molecules without the nitro group. Therefore, these bands are suitable to distinguish spectra with and without nitro compounds, and thus explains the high attention the nitro classifier assigns to this region. In summary, our observations suggest that the CNN uses not only fundamental vibrations characteristic of the functional groups, but also overtone and combination bands, to maximize the accuracy of its classifications.
More broadly, the patterns learnt by the CNN can be rationalized in terms of the principles of infrared absorption: chemical moieties absorb infrared light at characteristic group frequencies, such absorptions are peak-shaped, and might give rise to anharmonic siblings such as overtones and combination bands. This clear connection between physico-chemical principles and CNN patterns enables us to qualitatively assess the generalization abilities of the network, and further clears the path to leverage more chemical/physics-based knowledge in future model design. Physics-based intuition can be incorporated into the model, for example using IR tables to (smart-)initialize parameters, initializing filters that optimally recognize peaks, and/or transferring understanding from pretrained models60 to new datasets (potentially even from different spectral regions). Our results also highlight that bigger models are not always better; through a judicious balance between performance and explainability considerations, we deliver a predictive model that maintains transparency without compromising on predictive performance.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00203a |
‡ The code to further inspect the architecture and reproduce the results is available at https://github.com/laura-rieger/SpectraML-Classification. |
§ The number was chosen ad-hoc. Considering the ≈1.200 spectra in the test set, each with 209 receptive windows (∼240.000 samples), we consider the top 0.04% examples. |
This journal is © The Royal Society of Chemistry 2023 |