Jake
Williams
* and
Eric
Jonas
Department of Computer Science, University of Chicago, Chicago, USA. E-mail: williamsjl@uchicago.edu; ericj@uchicago.edu
First published on 15th September 2023
Accurate simulation of solution NMR spectra requires knowledge of all chemical shift and scalar coupling parameters, traditionally accomplished by heuristic-based techniques or ab initio computational chemistry methods. Here we present a novel machine learning technique which combines uncertainty-aware deep learning with rapid estimates of conformational geometries to generate Full Spin System Predictions with UnCertainty (FullSSPrUCe). We improve on previous state of the art in accuracy on chemical shift values, predicting protons to within 0.209 ppm and carbons to within 1.213 ppm. Further, we are able to predict all scalar coupling values, unlike previous GNN models, achieving 3JHH accuracies between 0.838 Hz and 1.392 Hz on small experimental datasets. Our uncertainty quantification shows a strong, useful correlation with accuracy, with the most confident predictions having significantly reduced error, including our top-80% most confident proton shift predictions having an average error of only 0.140 ppm. We also properly handle stereoisomerism and intelligently augment experimental data with ab initio data through disagreement regularization to account for deficiencies in training data.
Computational methods for generating, predicting and understanding NMR parameters are studied and implemented at multiple levels of theory through the use of a variety of physics-based and machine learning techniques. Towards the higher levels of theory are first principle (ab initio) techniques, which use our knowledge of physics and chemistry to compute molecular properties. A frequently used ab initio technique is density functional theory (DFT), which models electron densities to calculate molecular properties. Advances in DFT have allowed it to produce high quality shift and coupling values3,4 which allows them to be used in a variety of practical applications and as a common point of comparison.1,5,6 Towards the lower levels of theory are machine learning (ML) techniques, which primarily use data to determine their predictions. We explore the advancements in ML models in Section 2.
One of the major distinctions between techniques and individual models is their use of a 2D or 3D representation of a molecule. Molecules are commonly represented as graphs for the purposes of instruction and modeling, and this extends to some computational techniques. The mathematical representation of a graph, a set of vertices and the edges which connect them, is a two dimensional structure. Molecules cannot be described solely by their connectivity, however, as 3D properties such as varying bond angles and lengths can greatly impact the shape of a molecule and its resulting spectrum. This occurs due to the relatively long acquisition period of NMR (usually between 50 ms and up to a few second),7 allowing the device to observe multiple conformations of the same molecule and the spectrum to be composed of their averages.8 It is also susceptible to isomeric differences,2 which are not observed in 2D structures.
DFT (along with most ab initio techniques) uses exact atom coordinates rather than a graph. Machine learning techniques, on the other hand, vary greatly in their usage of 2D or 3D structures. Those that do use 3D structures typically do so by generating a conformer or set of conformers using DFT.9,10 DFT is slow, often prohibitively so, thus other methods for generating conformers may be considered, including minimizing force fields, parallel-tempering and distance geometry techniques, each with their own approximation to speed tradeoff. ETKDG11 is one such distance geometry technique which incorporates experimental and physical knowledge to improve its conformer generation.
Using machine learning and the 3D structure of a molecule, we seek to predict all chemical shift and coupling parameters, and provide a measure of uncertainty in our predictions. In this paper, we will briefly explore other machine learning based techniques for NMR prediction to highlight our improvements. We will then demonstrate our success across multiple prediction tasks and performance metrics, before giving a detailed explanation of our methodology.
With the advancement of deep learning, other recent models have adopted graph neural networks (GNNs) for this prediction task. Graph neural networks operate on a graph (a set of vertices and the edges that connect them) to predict per-vertex, per-edge, and per-graph properties. This connects naturally to the graphical model of a molecule and GNNs enforce relational biases14 that should be highly advantageous in NMR prediction tasks. Some networks, such as the model produced by Jonas and Kuhn,15 use a 2D molecule representation to strictly fit the GNN paradigm. Others, such as the CASCADE model,10 use the full 3D structure from DFT to better inform predictions. The 2D versus 3D representation questions may have limited some of these applications, as GNN models thus far have been limited solely to predicting chemical shifts, with no predictions of scalar couplings.
The further these models move from ab initio based techniques, the more important it is to understand their modes of failure. Even DFT can have major breakdowns that make it unreliable in some domains.16 Understanding whether a prediction can be trusted and to what degree is an important direction of work that is rarely addressed in machine learning tasks. Previous works have used ensembling9 or ensembling-like15 techniques to measure the uncertainty in their predictions, and we follow in a similar vein.
Our model generates Full Spin System Predictions with UnCertainty (FullSSPrUCe), meaning it predicts both chemical shift and scalar coupling values using a graph neural network with a 3D structure. Further, where previous ML methods which use 3D structures have taken their structures from DFT, we use distance geometries from ETKDG to make faster predictions. Lastly, we continue the previous work done to model uncertainty by providing a quantified estimate with each prediction. Thus, our model improves on the state-of-the-art by making more accurate predictions of full spin systems using conformers generated with faster methods, while providing a quantified estimate of uncertainty.
We will then consider the impact of our uncertainty quantification as a tool for sorting and selecting predictions. We perform uncertainty quantification by showing different portions of our model different subsets of the training data. We can then use all sections of the model at test time and compare their differences to estimate uncertainty, as shown in Fig. 9. As opposed to traditional ensemble methods, such as IMPRESSION,9 the result is a single model, trained only once, but with many of the benefits of full ensemble techniques. We refer to this method as bootstrapping, alluding to a similar method in statistics. This method is explained in detail in Section 4.4.
We then look at the impact of using ETKDG for conformer generation. We explore both the differences in performance and time for our model. We will also consider the downstream task of stereoisomer identification, which is enabled by the 3D structures and inspired by DP4.2
Lastly, we consider a prediction task in which both experimental and ab initio training data is available. This has traditionally been attempted using transfer learning,10 however we introduce a new method we call disagreement regularization. Disagreement regularization learns to predict experimental and ab initio data simultaneously, moderating the loss function along two channels by how similar the experimental and ab initio ground truth values are. See Section 4.5 for details on this loss function.
Throughout this section, each of our models is trained and evaluated on data from the NMRShiftDB.17 This is a user-contributed database of experimental NMR shift values for small molecules, which has been used previously to train and evaluate machine learning models, including those to which we are comparing our model.10,15 For our purposes, we further narrowed down the molecules to use to those with at most 128 atoms (including protons) and with only atoms H, C, O, N, F, S, P and Cl, resulting in approximately 33k molecules (see ESI† for more details). For each of these molecules, which have existing experimental parameters, we used DFT to compute an additional set of parameters. Thus, for the NMRShiftDB molecules, we have two ways that we can train and evaluate models – on experimental or ab initio data. Note that in both cases, the input remains the same – a 2D representation (SMILES string, etc.) of a molecule, from which we will use ETKDG to generate conformers. This means that when performing an ab initio prediction task, we are not predicting the results from the same final conformer that the DFT method would have. Here, models trained on experimental data are evaluated on experimental data and those trained on ab initio data are evaluated on ab initio, and we will explore mixing these types in later sections.
In Table 1, we compare our mean average error on 1H and 13C shift prediction tasks. We train four models using the NMRShiftDB data described above, split according to shift type (protons vs. carbons) and experimental vs. ab initio. Jonas and Kuhn uses this same dataset, but only the experimental data. IMPRESSION performs only ab initio prediction, having done their own DFT calculations on molecules from the Cambridge Structural Database,18 using adaptive sampling to obtain a training set of 882 molecules. CASCADE does both experimental and ab initio predictions, performing their own DFT calculations on an 8 K molecule subset of NMRShiftDB, and using a further 5 K subset of 13C experimental data that they believe is reliable. FullSSPrUCe provides a clear and significant improvement over the comparison models on all four prediction tasks, including 5.0% and 16.7% improvement over CASCADE on ab initio1H and 13C, respectively. Scatter plots of our predictions versus the ground truth for these four tasks are provided in Fig. 2.
In Table 2, we compare our mean average error on a subset of scalar couplings. Here we trained and tested only on ab initio data, because we did not have any large enough experimental datasets. We used this ab initio trained model to examine our performance on three small experimental datasets in Fig. 3. Note that CASCADE and Jonas and Kuhn's GNN do not make any coupling predictions and IMPRESSION predicted only 1JCH coupling, where we made on average 22.0% more accurate predictions.
Model | 1 J CH | 2 J HH | 3 J HH | 4 J HH |
---|---|---|---|---|
Gerrard:IMPRESSION9 | 0.870 | — | — | — |
FullSSPrUCe | 0.679 ± 0.014 | 0.194 ± 0.005 | 0.504 ± 0.011 | 0.121 ± 0.002 |
Fig. 3 Coupling prediction accuracy on previously chosen subset of coupling types, measured as the rolling mean average error and rolling 95th percentile error in Hz as quantified uncertainty increases. We observe a strong correlation, even on the three small experimental datasets shown here,4,19,20 again demonstrating the usefulness of the quantified uncertainty in selecting for the predictions with the best accuracy. |
We repeat this procedure for the subset of scalar coupling values we addressed earlier in Fig. 3. Recall that we only performed training on ab initio data due to the lack of experimental data. We did collect a small amount of experimental coupling data from three sources,4,19,20 though these were used only as additional test sets for our 3JHH accuracy. We again see the strong correlation between increased uncertainty and increased error. On ab initio coupling predictions, taking only the top-80% most confident predictions improves MAE performance between 14.4 and 47.7%. A measure of uncertainty can be crucial in understanding and utilizing predictions. Our uncertainty measure is simple to train and effective at test time in identifying predictions' expected accuracy.
This sorted, rolling error measurement provides a useful way to visualize the relationship between error and the uncertainty quantification. However, we may also want a numerical estimate of this relationship, so we directly measure the correlation between the error and the uncertainty values for each model, as reported in Table 3. We see that the correlation is meaningfully positive for all models, including when evaluating the coupling models on the selected small external datasets.
Given the similarities in performance, the main motivation for using ETKDG is that distance geometry techniques are faster. In Table 4, we examine the difference in time taken to generate an ensemble of conformers across multiple methods. We also note that each of these methods are approximately linear in time to generate larger sets of conformers. Returning to Fig. 5, the tradeoff between accuracy and number of conformers can be interpreted as accuracy and time taken. However, the tradeoff is not strictly linear here and so we could pick a point along the curve which balances the tradeoff to a desired level for our application. In this paper, we use 50 conformers to demonstrate the top end accuracy of our models, but for other tasks (such as predicting datasets on the order of tens of millions of molecules), using an ensemble with only 10 conformers may be appropriate.
Method | Time taken per molecule |
---|---|
DFT | 2 days |
PT | 1 h |
ETKDG | 0.176 s |
FullSSPrUCe | 0.064 s |
While ETKDG is faster than either PT or DFT, and the conformers it generates are good enough to allow our models to achieve similar accuracy, it is still the bottleneck operation in the prediction pipeline, as shown in Fig. 6. This will become even more of an issue as we investigate larger, more diverse molecules as the time taken to generate conformers for larger molecules scales quadratically. However, the features that get passed to the model represent summary statistics of the set of conformers generated. This means that lower quality and fewer conformers matter less in general as we have shown so far. There still may be room for molecular geometry featurization to be sped up further, whether through other methods for generating conformers or skipping that step entirely, without losing significant model accuracy.
For this task, we use a new dataset, specifically a subset of GDB-17.21 GDB-17 contains multiple different stereoisomers of many of the molecules in its dataset. Using DFT, we can then obtain separate NMR parameters for each stereoisomer. We selected molecules using the same criteria as in Section 3.1, resulting in a dataset with approximately 66k total molecules (see ESI† for more details). We retrain our model for predicting 1H and 13C shifts on this dataset, in the hopes that a model which has seen only data with many stereoisomers of the same molecules will be useful for the purpose of performing stereoisomer identification. We base our work on stereoisomer identification on the work in creating the DP4 probability.2 In the original problem set up, the experimental 1H and 13C shift parameters are known for a molecule, but the particular stereoisomer is unknown. The goal is to identify which of a set of possible stereoisomers the experimental parameters were obtained from. In the original work, Smith and Goodman generated ab initio calculations of the parameters for each stereoisomer, and compared them to the experimentally derived parameters to generate a distribution over the set of stereoisomers. Here, we take the ab initio data in GDB-17 as the known parameters, and use our model to generate comparison parameters for the set of possible stereoisomers to identify the most likely candidate.
For each of 980 test molecules, we generated up to 7 other stereoisomers (if there were enough permutations available), for a total of 8 candidate structures. This resulted in 665 of the molecules having 8 candidate structures, and the majority of the remaining molecules having 4 candidate structures. We then use FullSSPrUCe to generate predicted 1H and 13C shift values and DP4 to assign probabilities to each candidate structure. We are able to correctly identify the true structure approximately 49% of the time, including 45% of the time when a molecule has 8 candidate structures, and achieve top-2 accuracy on 73% of molecules (65% for 8 candidate molecules). Further breakdowns of the results are provided in the ESI.†
Our model clearly outperforms random guesses by a significant margin, but it is not as accurate as we might have expected. When training FullSSPrUCe on GDB-17, we cannot use all structures in the database due to its enormous size (∼166 billion molecules), and so we may not get a representative sample when choosing our subset to train on. In fact, we observe a degradation in testing performance of almost twice as much error (0.162 ppm for protons and 2.117 ppm for carbons). This work shows that our model has the capacity to distinguish between stereoisomers using proton and carbon shifts, but needs to be trained and tested more thoroughly before it can be consistently relied upon for these decisions in the same manner that we use DFT.
These difficulties have been acknowledged in previous works. IMPRESSION and Jonas and Kuhn each chose to work exclusively with either ab initio or experimental data due in part to these differences. CASCADE, on the other hand, used both to create three models (the DFTNN, ExpNN-dft and ExpNN-ff models previous discussed). They began by training DFTNN on 8 K molecules with ab initio assignments, then used transfer learning to train to experimental assignments. However, it is important to note that to train their experimental models, they used only a 5 K molecule subset of their total training dataset, removing the molecules where experimental data and ab initio data disagreed beyond a certain threshold. Transfer learning provides a way to incorporate ab initio data into the training process, while still prioritizing experimental data, but it does not address the issues we have discussed in experimental data.
We attempt to address the strengths and weaknesses of both types of data through our new technique of disagreement regularization. We train jointly on both types of data, using a model with two output channels (one for each type of data), but that share the majority of their parameters. The loss function then prioritizes accuracy on the experimental channel, but compares the ground truth on each channel to determine how much to do so (see full details in Section 4.5). We evaluated this technique by dividing NMRShiftDB17 into two unique subsets.
We split the data into two subsets based on the presence of ‘small rings’ in the molecule, namely those rings with exactly 3 or 4 atoms. These rings create unique geometric and chemical properties, and so we expect that making predictions about NMR properties on these molecules without training data regarding them will be difficult. However, we can easily obtain ab initio data for these molecules to see how disagreement regularization helps us recover our performance. For this experiment, we focused only on proton shift prediction.
Our results are summarized in Table 5. We begin with two models trained strictly on experimental data: the baseline and the experimental control. The baseline is our normal experimental model, trained with experimental data from all ‘areas’ of chemical space that we are interested in. It shows that small ring molecules tend to be slightly more difficult to predict than big ring molecules, averaging 0.247 ppm error, as opposed to the big ring MAE of 0.205 ppm. The experimental control is a model trained with only experimental big ring data, so that it has never seen any molecules with small rings. It has a nearly identical big ring MAE, but its small ring MAE increases significantly to 1.248 ppm. This demonstrates that small rings operate as an area of chemical space that requires training data to predict accurately. However, as we know, we cannot always get experimental data from all areas of chemical space. Our goal then is to construct a model which performs well on the experimental small rings without having seen any such experimental training data.
Model | Experimental data? | Ab initio data? | MAE (ppm) | |||
---|---|---|---|---|---|---|
Big rings | Small rings | Big rings | Small rings | Big rings | Small rings | |
Baseline | ✓ | ✓ | 0.205 | 0.247 | ||
Experimental control | ✓ | 0.206 | 1.248 | |||
Ab initio control | ✓ | ✓ | 0.205 | 0.283 | ||
Disagreement (λ = 5) | ✓ | ✓ | ✓ | 0.205 | 0.268 | |
Disagreement (λ = 10) | ✓ | ✓ | ✓ | 0.243 | 0.206 |
We start with a naive implementation of this, which is to add in all of the ab initio data from the small rings that we have and train jointly. This is referred to as the ab initio control model. Adding in the ab initio small ring data does not impact the big ring MAE significantly, however it does bring the small ring MAE back to a reasonable value at 0.283 ppm. This is still worse than the baseline, which is to be expected given that we know ab initio data has systematic error when evaluated on experimental data. Therefore, training directly to ab initio data (including models trained with only ab initio data) will not be able to recover the baseline performance.
This brings us to our disagreement regularization models, which we designed in the hopes of tackling these sorts of tasks. In this case, we provide all of the ab initio data that we have, for both big and small rings, along with the experimental big rings. Using disagreement regularization, we are able to recover the big ring baseline while also reducing small ring error to 0.268 ppm, a significant improvement over the controls. We also explored the effect of our tuning parameter, λ, which balances how much to weight between ab initio and experimental loss. We find that some values of λ can improve small ring performance beyond even the baseline, although this comes at the cost of performance on the big rings. This tradeoff is explored more in Fig. 7.
In the first step, we create per-vertex features using basic properties of atoms. For each atom, we generate a tensor which includes the atomic number, the default and total valence of the atom, partial and formal charge of the atom, and other atom specific properties. These properties are all obtained directly from the RDKit22 Molecule object and the periodic table. They are collected into a tensor that we will refer to as x, which has shape N × fv, for N atoms in the molecule and fv per-vertex features.
In the second step, we create a set of adjacency matrices directly from the graph. There are five total adjacency matrices, each of shape N × N and each symmetric. The first has a 1 wherever two vertices have any bond between them and 0s everywhere else. The remaining only place 1s where there is a specific bond type between the two vertices, with one adjacency matrix for each of single, double, triple, and bond order 1.5. This set of adjacency matrices is collected into a tensor that we will refer to as Gadj which has shape N × N × 5.
Finally, we create a set of per-vertex-pair properties using geometric properties of the molecule. Specifically, we examine the distances between pairs of atoms and the angles between atoms which share a neighbor. These properties are calculated from an ensemble of molecular conformers that specify the 3D structure (position of each atom) of the molecule, allowing us to easily compute distances and angles between pairs of atoms. In experimental NMR, conformational variability leads to the observation of a weighted average of the parameters of each possible conformer, so we aim to capture more than a single value for the distance and angle features. We include the (Boltzmann weighted) mean over the conformers for some features, but also use a collection of Gaussians to parameterize the distribution over the conformers more thoroughly. We collect these features as Gfeat, which has a shape N × N × fe, for fe per-vertex-pair features.
Generating these conformers has thus far been the slowest part of the prediction pipeline, but has shown promise in increasing accuracy and provides the only features that are capable of distinguishing stereoisomers. As a proof of concept, we started out using an ensemble of conformers drawn from the Boltzmann distribution through parallel tempering (PT). These PT conformers provide us with highly accurate distances and angles as well as an accurate sampling of the distribution of conformers, which is important for capturing the conformational variability. However, we are aiming to do rapid predictions of NMR properties, so we turn to a conformer generation method which, while less accurate, is much faster. To that end, we use ETKDG,11 a conformation generation method which uses distance geometry and experimental torsion-angle preferences and other basic knowledge to create conformers. We then apply a single MMFF94 optimization step, which includes calculating the Boltzmann weight of each conformer, and that weight can be used for the averaging of the features from each conformer. Through the use of the ETKDG conformers and basic molecular properties, we are able to rapidly generate feature vectors for each atom and each atom-pair (see ESI† for full details of features used). These vectors are fed as the inputs x, Gadj and Gfeat to the GNN, as shown Fig. 8.
The message passing scheme can be augmented to pass messages beyond a single step quite easily. To do so, we add layers to Gadj which correspond to powers of Gadj. For example, Gadj3 allows us to pass messages between vertices connected by three bonds. We can also add the identity matrix to Gadj to add self-loops. All of this can be done within the context of matrix multiplication and the stacking of our additional adjacency matrices to create layers with multiple types of message passing occurring simultaneously. Using this basic GNN message passing algorithm through matrix multiplication, these layers update per-vertex properties in a manner that can be useful for predicting shift values, but does not incorporate geometric properties efficiently.
In the decode layers, we begin by updating per-vertex properties by using x as the hidden state and G as the input. G needs to be reshaped, which is done using an aggregation function over one of the N dimensions. This results in the update xout, which is then expanded to be used as the input to the second GRU, whose hidden state is the original G. This creates the update Gout, which, along with xout, can then be passed through a non-linear layer and normalization before being passed to the next layer.
This repeated updating structure allows us to use the geometric features more naturally by updating them progressively using the relevant atomic features. Since both x and G are updated, we can use these outputs to predict both shift and coupling values. However, even in scenarios where we are only predicting shifts, these layers can be valuable in using the geometric features, and so we combine the updated per-vertex features from decode and message passing layers into a single per-vertex tensor that is then subjected to final layers for shift predictions. Through this combination of layers, we can consider atomic and interatomic features to produce full spin system predictions.
Our bootstrapping method, as shown in Fig. 9, creates an ensemble of multiple FullSSPrUCe models, where each share most of their parameters. This differs from most ensemble methods where each inducer is trained completely separately, sometimes with entirely different types of models.25 Ours differ only on their final layers (those after the message passing and decode layers), which are simple residual and linear layers directly modifying xout and Gout, which we call the bootstraps. At train time, the error for a given molecule is decided by considering the predictions from only a subset of the bootstraps, by taking the mean over that subset. This allows each bootstrap to learn from a different subset of the training data without duplicating the training of the shared parameters.
Fig. 9 Each layer sees a different subset of the data at train time. Then, at test time, we use the mean and standard deviation across the layers to make predictions and quantify uncertainty. |
At test time, we generate our predictions using the mean of all of the models' predictions, but we can examine their differences in predictions to quantify the uncertainty. We use the standard deviation of the models' individual predictions as the uncertainty in our ensembled prediction, which has proven to correlate quite strongly with accuracy. Through our bootstrapping method, we are provided with a quantification of the uncertainty in a prediction, an important step towards better understanding and using the model.
The naive way to use all our data would be to simply augment our experimental training set with as much ab initio data as we can obtain. The main issue is that ab initio data is known to have systemic errors, and so we need to make some distinction between ab initio and experimental. The classic ML method for doing so is to use transfer learning, typically by pretraining a model on ab initio data and then ‘fine-tuning’ it on the experimental data. Guan et al. use this method with their GNN to moderate success. However, this assumes that experimental values are always the gold standard for our training data, which we know they are not. Misassigned peaks, missing values, and experimental noise can degrade this data.
Our approach seeks to learn from ab initio and experimental data jointly, as in the naive approach, while favoring the experimental values, as in the transfer learning approach, but with an awareness of the flaws experimental data presents. We do so using a technique termed disagreement regularization, which compares ab initio and experimental values for a given molecule to decide how to assign the loss. Specifically, we create two output ‘channels’, one which seeks to predict the ab initio values for a molecule and one which seeks to predict experimental values. Creating these channels is as simple as having the final linear layers produce a result which is N × 2 rather than N × 1 for shift predictions. When we have only experimental or only ab initio data for a particular atom, we simply set the loss for that value as the loss along the appropriate channel. For the remaining values, the loss function is then a weighted combination of the loss on these two channels, where the weight is determined by how much the ab initio and experimental ground truths agree.
How should we choose the weight between our two channels? Our observation is that when ab initio and experimental data agree, both are likely to be good. When both are good, we are more interested in predicting experimental values. When the data disagree, it is most likely a result of noise or error in the experimental data, since ab initio data is relatively noise-free and does not suffer from human error. So, when the agreement is strong, we should weight towards experimental data, and when agreement is weak, we should weight towards ab initio. This is reflected in the loss function in eqn (1), where La and Le are the losses on the ab initio and experimental channels, respectively, ϕa and ϕe are the ab initio and experimental values given in the data, and λ is a tunable hyperparameter. We have chosen to leave the 1 ppm offset in the denominator fixed, but future work may explore the effects of adjusting this value as well.
(1) |
Our hope is that this loss function will combat problems in both data sources. Because the two channels share inputs and parameters up to the very final layer, we believe that this loss function encourages the model to learn the difference between ab initio and experimental data to correct the systematic error in ab initio data. By weighting down the loss when experimental data is likely to be incorrect, as identified by its disagreement with ab initio data, the model can ignore potentially noisy or erroneous data. With the main goal of predicting experimental results, our disagreement regularization loss intelligently combines ab initio and experimental data.
As shown in Section 3.1, our model has lower errors on its testing set than do the models to which we compared. Each previous model has made its own different choices, and we believe that our combination of choices has led to this improved performance. IMPRESSION is the only non-GNN based method. The Jonas and Kuhn GNN is the only method to use exclusively topological information. CASCADE intentionally restricted its dataset, particularly for experimental data, due to the potential unreliability of NMRShiftDB data, and did not train any models exclusively on experimental data. Unlike all previous methods, we do not draw features from a single conformer, but rather from an ensemble of conformers, which seems to makes the features more useful. We combine the largest amount of data with the most consistently successful aspects of each previous model, and as a result we have seen the best performance. This performance is a strong step forwards, but there are remaining areas that deserve further exploration.
One such area is solvent and temperature effects, which we ignore because user-contributed NMRShiftDB17 spectra often do not have solvent or temperature labels, which may contribute to error in multiple ways. Primarily, it will increase noise through unaccounted experimental variability. We suspect it also contributes to a noticeable difference between error rates for protons bonded to carbons and those bonded to other atoms. Across models, we see much better accuracy on predictions of 1H shifts for protons bonded to carbons, while those bonded to nitrogens, oxygens and other heavy atoms have significantly higher error (full details in ESI†). This is consistent with ab initio results, which tend to be further from experimental values for protons not bonded to carbons. These effects can often be corrected by accounting for solvent and temperature,30 which provides an interesting avenue of future work.
Recall that we use ETKDG conformers to create geometric features of molecules as inputs to our network. We showed a tradeoff between accuracy and number of conformers, which serves as a proxy for time spent generating conformers. This tradeoff is certainly not exclusive to our methods, as all computational techniques make some trade off between theory and time (e.g. molecular dynamics vs. distance geometry, etc.). Even other neural network models have this trade off.10 However, since our model takes particular metrics of the distribution of properties across conformers, we believe that we could improve on the tradeoff by calculating these metrics directly, leveraging machine learning techniques rather than generating a set of conformers to measure the metrics. We would also like to improve on the stereoisomer identification task that relies on these conformers. Whether this improvement is through faster, more accurate geometry features or better training of molecules with stereoisomers, we believe that we can achieve DFT level accuracy in this task.
Our disagreement regularization tool should also continue to be developed, as it provides us with an interesting method for combining experimental and ab initio data. We believe that the inclusion of the full ab initio dataset allows the model to learn the differences between ab initio and experimental values, and thus correct its small ring predictions accordingly. The noiseless ab initio data may also allow us to correct errors in experimental data, especially misassignments of spectra to their correct atoms, allowing us to surpass our baseline in some cases. However, the effects of certain hyperparameters present some unusual results and need further study to be fully understood.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3sc01930f |
This journal is © The Royal Society of Chemistry 2023 |