Marcus
Wieder
*a,
Josh
Fass‡
*ab and
John D.
Chodera
a
aComputational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA. E-mail: marcus.wieder@choderalab.org
bTri-Institutional PhD Program in Computational Biology and Medicine, Weill Cornell Graduate School of Medical Sciences, New York, NY 10065, USA
First published on 19th July 2021
The computation of tautomer ratios of druglike molecules is enormously important in computer-aided drug discovery, as over a quarter of all approved drugs can populate multiple tautomeric species in solution. Unfortunately, accurate calculations of aqueous tautomer ratios—the degree to which these species must be penalized in order to correctly account for tautomers in modeling binding for computer-aided drug discovery—is surprisingly difficult. While quantum chemical approaches to computing aqueous tautomer ratios using continuum solvent models and rigid-rotor harmonic-oscillator thermochemistry are currently state of the art, these methods are still surprisingly inaccurate despite their enormous computational expense. Here, we show that a major source of this inaccuracy lies in the breakdown of the standard approach to accounting for quantum chemical thermochemistry using rigid rotor harmonic oscillator (RRHO) approximations, which are frustrated by the complex conformational landscape introduced by the migration of double bonds, creation of stereocenters, and introduction of multiple conformations separated by low energetic barriers induced by migration of a single proton. Using quantum machine learning (QML) methods that allow us to compute potential energies with quantum chemical accuracy at a fraction of the cost, we show how rigorous relative alchemical free energy calculations can be used to compute tautomer ratios in vacuum free from the limitations introduced by RRHO approximations. Furthermore, since the parameters of QML methods are tunable, we show how we can train these models to correct limitations in the underlying learned quantum chemical potential energy surface using free energies, enabling these methods to learn to generalize tautomer free energies across a broader range of predictions.
Tautomerism can also alter molecular recognition, making it an important consideration for supramolecular chemistry. To optimizing hydrogen bond patterns between a ligand and a binding site tautomerism has to be considered. In a theoretical study of all synthetic, oral drugs approved and/or marketed since 1937, it has been found that 26% exist as an average of three tautomers.5 While tautomerism remains an important phenomena in organic chemistry, it has not gained much appreciation in other scientific fields.
The typical small free energy difference between tautomers poses additional challenges for protein-ligand recognition. Local charged or polar groups in the protein binding pocket can shift the tautomer ratio and result in dominant tautomers in complex otherwise not present in solution.6 For this reason the elucidation of the dominant tautomer in each environment is not enough—without the knowledge about the ratio (i.e. the free energy difference) between tautomeric forms in the corresponding phase a correct description of the experimental (i.e. macroscopic) binding affinity might not be possible. To illustrate this, one might think about two extreme cases: one in which the tautomeric free energy difference in solution is 10 kcal mol−1 and one in which it is 1.0 kcal mol−1. It seems unlikely that the free energy difference of 10 kcal mol−1 will be compensated by the protein binding event (therefore one could ignore the unlikely other tautomer form), but a tautomeric free energy difference of 1 kcal mol−1 could easily be compensated. Examples for this effect—changing tautomer ratios between environments—are numerous for vacuum and solvent phase. One example is the neutral 2-hydroxypyridine (2-HPY)/2-pyridone (2-PY) tautomer. The gas phase internal energy difference between the two tautomers is ∼0.7 kcal mol−1 in favor of the enol form (2-HPY), while in water an internal energy difference of 2.8 kcal mol−1 was reported in favour of 2-PY. In cyclohexane, both tautomer coexist in comparable concentration while the 2-PY is dominant in solid state and polar solvents.7
The example above assumed two tautomeric forms—if there were multiple tautomeric forms, each with small free energy differences, using a single dominant tautomer in solvent and complex may still represent only a minor fraction of the true equilibrium tautomer distribution.
In the absence of a reliable, cheap, and fast experimental protocol to characterize tautomeric ratios for molecules, there is a need for computational methods to fill the gap. And—even if such a method exists—predicting tautomer ratios of molecules that are not yet synthesized motivates the development of theoretical models. Computational methods themselves have a great need for defined tautomer ratios. Most computational methods use data structures in which bond types and/or hydrogen positions have to be assigned a priori and remain static during the calculation, introducing significant errors if an incorrect dominant tautomer is chosen.11
The third round of the Statistical Assessment of the Modeling of Proteins and Ligands (SAMPL2) challenge included the blind prediction of tautomer ratios and provided an interesting comparison between different computational methods.12 Most of the 20 submissions were using implicit solvent models and ab initio or DFT methods in combination with a thermodynamic cycle (as shown in Fig. 1) to evaluate tautomeric free energy difference.12 However, as stated in Martin,13 “In summary, although quantum chemical calculations provide much insight into the relative energies of tautomers, there appears to be no consensus on the optimal method”. For the 20 tautomer pairs investigated in the SAMPL2 challenge (this includes 13 tautomer pairs for which tautomer ratios were provided beforehand) the three best performing methods reported an root mean squared error (RMSE) of 2.0,14 2.5 (ref. 15) and 2.9 kcal mol−1.16 While these results are impressive it also shows that there is clearly room for improvement. And, maybe even more importantly, the SAMPL2 challenge showed the need for investigating a wider variety of tautomer transitions to draw general conclusions about best practices and methods for tautomer predictions.12 The excellent review of Nagy concluded that tautomer relative free energies are sensitive to “the applied level of theory, the basis set used both in geometry optimization and […] single point calculations, consideration of the thermal corrections […] and the way of calculating the relative solvation free energy”.17 We'd like to add “quality of 3D coordinates” to these issues and discuss them in the following.
The standard-state Gibbs free energy is calculated using a quantum chemistry estimate for the electronic energy and, based on its potential energy surface, a statistical mechanics approximation to estimate its thermodynamic properties. While the accuracy of the electronic energy and the transfer free energy ΔGS is dependent on the chosen method and can introduce significant errors in the following we want to concentrate on the approximations made by the thermochemical correction to obtain the gas phase free energy: the rigid rotor harmonic oscillator (RRHO) approximation and the use of single or multiple minimum conformations to generate a discrete partition function (as written in eqn (1)). In the following the standard-state of all the thermodynamic properties should be implied and will not be added to the notation.
A commonly used approach for the zero point energy (ZPE) and thermal contributions is based on the ideal gas RRHO, assuming that the molecule is basically rigid and its internal motions comprise only vibrations of small amplitude where the potential energy surface can be approximated as harmonic around a local energy minimum. This assumption leads to an analytical approximation that describes the local configuration space around a single minimum based on the local curvature of the potential energy surface (Fig. 2).18
Errors are introduced because the harmonic oscillator approximation assumes that, for each normal mode, the potential energy associated with the molecule's distortion from the equilibrium geometry has a harmonic form. Especially low-frequency torsional modes would be more appropriately treated as hindered internal rotations at higher temperatures. This can lead to a significant underestimate of the configurational entropy, ignoring the contributions of multiple energy wells.19 The correct treatment of such low-frequency modes in the analytical rotational entropy part of the RRHO partition function can add considerable computational cost, since the numerical solution requires the calculation of the full torsion potential (periodicity and barrier heights).20
Another, related shortcoming of the RRHO approach is the focus on a single minimum conformation. Assuming that the correct global minimum has been identified this approach ignores the configurational entropy of all other conformations to the partition function. Methods like the ‘mining minima’ approach can help to mitigate this problem and construct a partition function using local configurational integrals from multiple minimum energy wells.19 The standard state free energy G° of a molecule can then be calculated for each of the minimum conformations k separately and combined as follows
(1) |
One of the most exciting developments in recent years has been the introduction of fast, efficient, accurate and transferable quantum machine learning (QML) potentials (e.g. ANI,23 PhysNet,24 and SchNet25) to model organic molecules. QML potentials can be used to compute QM-based Hamiltonians and—given sufficient and appropriate training data—are able to reproduce electronic energies and forces without loss of accuracy but with orders of magnitude less computational cost than the QM methods they aim to reproduce. QML potentials have been successfully applied to molecular dynamics and Monte Carlo (MC) simulations.26 Here, we present the application of a machine learning potential for the calculation of alchemical free energies for tautomer pairs in the gas phase (Fig. 3).
We begin by investigating the accuracy of a current, state of the art approach to calculate tautomer ratios on a set of 468 tautomer pairs selected from the publicly available Tautobase dataset9 spanning different tautomer reactions, number of atoms and functional groups. We are using a popular DFT functional (B3LYP) and basis sets (aug-cc-pVTZ and 6-31G(d))27–31 for the calculation of the electronic energy and a continuum solvation model (SMD32) to model the transfer free energy.
Furthermore we are investigating the effect of a more rigorous statistical mechanics treatment of the gas phase free energies. To investigate this effect we calculate tautomeric free energy difference in vacuum ΔtGcalcvac using (1) alchemical relative free energy calculations and (2) multiple minimum conformations in combination with the RRHO approximation (as used in the quantum chemistry approach). To enable a direct comparison between the two approaches we are using a QML potential in both calculations.
Since we are interested in tautomeric free energy differences in solution we investigate the possibility to optimize the QML parameters to include crucial solvent effects. The framework we have developed to perform alchemical relative free energy calculations enables us to obtain a relative free energy estimate that can be optimized with respect to the QML parameters. We are using a small training set of experimentally obtained tautomer free energies in solution ΔtGexpsolv and importance sampling to obtain reweighted with the optimized parameters.
Here we report the first large scale investigation of tautomer ratios using quantum chemical calculations and machine learning potentials in combination with alchemical relative free energy calculations spanning a large chemical space and ΔtGexpsolv values.
We calculated the tautomeric free energy difference ΔtGcalcsolv for 468 tautomer pairs (460 if the basis set 6-31G(d) was used) and compared the results with the experimentally obtained tautomer ratios in solution (expressed as free energy difference in solution ΔtGexpsolv) in Fig. 4.
Fig. 4 State of the art quantum chemistry calculations are able to calculate tautomeric free energy differences ΔtGcalcsolv with a RMSE of 3.1 kcal mol−1. The direction of the tautomer reaction is chosen so that the experimentally obtained tautomeric free energy difference ΔtGexpsolv is always positive. Panel (A) shows ΔtGcalcsolv as the difference between the sum of the gas phase free energy and transfer free energy for each tautomer pair plotted against the experimental tautomeric free energy difference in solution ΔtGexpsolv. B3LYP/aug-cc-pVTZ is used for the gas phase geometry optimization and single point energy calculation, the ideal gas RRHO approximation is used to calculate the thermal corrections. The transfer free energy is calculated on B3LYP/aug-cc-pVTZ/SMD optimized geometries using B3LYP/6-31G(d) and SMD. Values in quadrant II (x-axis entries positive, y-axis entries negative) indicate calculations that assigned the wrong dominant tautomer species (different sign of ΔtGcalcsolv and ΔtGexpsolv). The dashed line indicates the ideal behavior of the calculated and experimental values, the grey lines mark the ±1 kcal mol−1 interval. Red dots indicate tautomer pairs with more than 10 kcal mol−1 absolute error between ΔtGcalcsolv and ΔtGexpsolv. These tautomer pairs are separately shown in Table S.I.1.† In panel (B), the top panel shows the kernel density estimate (KDE) and histogram of ΔtGcalcsolv and ΔtGexpsolv. The red line indicates zero free energy difference (equipopulated free energies). In the lower panel the KDE of the difference between ΔtGexpsolv and ΔtGcalcsolv is shown. MAE and RMSE are reported in units of kcal mol−1. Quantities in brackets [X;Y] denote 95% confidence intervals. The Kullback–Leibler divergence (KL) was calculated using KL(ΔtGexpsolv‖ΔtGcalcsolv). |
Three quantum chemistry approaches were used to calculate these values:
The protocol used to obtain the results described above did not explicitly account for changes in internal rotors (only changes in the point group were considered). The results including changes in internal rotors are shown in Fig. S.I.5.† Including internal symmetry numbers did not improve the results.
The following discussion will concentrate on the results obtained with the best performing method (B3LYP/aug-cc-pVTZ/B3LYP/6-31G(d)/SMD). The results for all other methods are shown in the ESI Section in Fig. S.I.4.†
The tautomeric free energy difference in solution ΔtGcalcsolv calculated with B3LYP/aug-cc-pVTZ/B3LYP/6-31G(d)/SMD are shown in Fig. 4. The RMSE for the calculated values is 3.1 [2.7;3.4] kcal mol−1. Overall, the method tends to overestimate the tautomeric free energy difference in solution, as can be seen from Fig. 4B (lower panel). For 59 tautomer pairs (13% of the dataset) the method was not able to correctly determine the dominant tautomer species. For these 59 tautomer pairs the mean absolute error (MAE) between predicted and experimental value was 2.9 kcal mol−1.
5 out of the 6 tautomer pairs with highest absolute error (above 10 kcal mol−1) between experimental and calculated free energy difference have common scaffolds (values and molecules shown in the ESI Table S.I.1†). tp_1668, 1669, 1670 are all based on 5-iminopyrrolidin-2-one with either nitrogen, oxygen or sulfur on position 4 of the ring. tp_1559 and tp_853 both have methoxyethylpiperidine as a common substructure.
For the three analogues based on 5-iminopyrrolidin-2-one, we note that the experimental values are estimated results based on a three-way tautomeric reaction.34 While this might point to the unreliability of ΔtGexpsolv, we can not draw any definitive conclusions here without repeating the underlying experiment.
Since all four mentioned participants employed different methods, we will briefly describe the best performing methods of these publications for which results are shown in Table 1. The references to the methods used below are not cited explicitly, these can be found in the publications cited at the beginning of the following paragraphs.
Name | ΔtGexpsolv [kcal mol−1] | ΔtGcalcsolv [kcal mol−1] | ΔtGcalcsolv [kcal mol−1] | ΔtGcalcsolv [kcal mol−1] | ΔtGcalcsolv [kcal mol−1] | ΔtGcalcsolv [kcal mol−1] |
---|---|---|---|---|---|---|
This work | 16 | 15 | 14 | 35 | ||
1A_1B (tp_982) | −4.8 | −4.7 | −4.0 | −3.0 | −7.7 | −4.6 |
2A_2B | −6.1 | −6.8 | −5.7 | −5.7 | −9.7 | −6.3 |
3A_3B (tp_999) | −7.2 | −8.4 | −7.7 | −6.7 | −11.2 | −7.7 |
4A_4B | −4.8 | −0.4 | 0.5 | 0.8 | −4.6 | 0.6 |
5A_5B (tp_1614) | −4.8 | −4.7 | −3.9 | −4.4 | −6.2 | −5.6 |
6A_6B (tp_1005) | −9.3 | −11.4 | −7.6 | −9.7 | −11.2 | −10.0 |
RMSE | 1.3 [0.7,1.7] | 1.4 [0.6,2.1] | 1.5 [0.4,2.3] | 2.8 [2.0,3.5] | 1.3 [0.3,2.1] | |
7A_7B (tp_141) | 6.6 | 4.9 | 5.3 | 6.5 | 5.1 | 5.5 |
10B_10C (tp_1058) | −2.9 | −5.3 | 1.7 | 0.0 | −2.8 | 2.2 |
10D_10C (tp_1059) | −1.2 | −1.7 | 3.8 | 2.6 | −0.6 | 5.0 |
12D_12C (tp_1514) | −1.8 | −2.1 | 3.3 | 3.1 | −0.8 | 3.0 |
14D_14C (tp_1072) | 0.3 | −1.6 | 1.9 | 0.8 | 0.2 | 4.0 |
15A_15B (tp_504) | 0.9 | 6.1 | −3.0 | 3.6 | 0.0 | 0.9 |
15A_15C (tp_505) | −1.2 | 0.7 | −0.8 | 2.3 | −1.9 | 1.4 |
15B_15C (tp_506) | −2.2 | −5.0 | 1.8 | −1.2 | −1.9 | 0.5 |
RMSE | 2.5 [1.5,3.6] | 3.6 [2.5,4.5] | 2.9 [1.8,3.8] | 0.8 [0.5,1.1] | 3.8 [2.5,4.9] | |
Total RMSE | 2.2 [1.4,2.9] | 2.9 [2.0,3.6] | 2.4 [1.6,3.1] | 1.9 [1.2,2.6] | 3.0 [1.9,4.0] |
Klamt and Diedenhofen16 used BP86/TZVP DFT geometry optimization in vacuum and the COSMO solvation model. Free energy in solution was calculated from COSMO-BP86/TZVP solvation energies and MP2/QZVPP gas-phase energies. Thermal corrections (including ZPE) were obtained using BP86/TZVP gas phase frequencies.
Ribeiro et al.15 calculated the free energy in solution as the sum of the gas-phase free energy and the transfer free energy. The gas phase free energy was calculated using M06-2X/MG3S level of theory and the molecular geometries optimized with the same method. The corresponding transfer free energy were computed at the M06-2X/6-31G(d) level of theory with the M06-2X/MG3S gas-phase geometries using the SM8, SM8AD, and SMD continuum solvation models (Table 1 shows only the results with SM8AD, which performed best).
Kast et al.14 optimized geometries in gas and solution phase (using the polarizable continuum solvation model PCM) using B3LYP/6-311++G(d,p). Energies were calculated with EC-RISM-MP2/aug-cc-pVDZ on the optimized geometries in the corresponding phase. The Lennard-Jones parameters of the general Amber force field (GAFF) were used.
Soteras et al.35 used the IEF-MST solvation model parameterized for HF/6-31G(G) to obtain transfer free energy values. Gas phase free energy differences were obtained by MP2 basis set extrapolation using the aug-cc-pVTZ basis set at MP2/6-31+G(d) optimized geometries. Correlation effects were computed from the CCSD-MP2/6-31+G(d) energy difference.
The approach used in this work (B3LYP/aug-cc-pVTZ/B3LYP/6-31G(d)/SMD) performs well compared to the four approaches described above. For the total set of investigated tautomer pairs our approach has a RMSE of 2.2 [1.4,2.9] kcal mol−1, making it the second best performing approach only outperformed by Kast et al.14
The difference in RMSE between the explanatory and blind dataset is noteworthy. Approaches that perform well on the blind data set perform worse on the explanatory set and vice versa. This is to a lesser extent also true for our chosen approach—B3LYP/aug-cc-pVTZ/B3LYP/6-31G(d)/SMD performs worse for the explanatory tautomer set (RMSE of 2.5 [1.5,3.6] kcal mol−1) than on the blind tautomer set (RMSE of 1.3 [0.7,1.7]), but in comparison with the other four approaches, it is consistently the second best approach.
Interesting to note are the three tautomer pairs 15A_15B, 15A_15C and 15B_15C from the explanatory data set. The absolute error for these three pairs are 5.24, 1.9, and 2.8 kcal mol−1, respectively. It appears that the used approach has difficulty to model 15B correctly, showing larger than average absolute errors whenever 15B is part of the tautomer reaction. Most likely the hydroxyl group in 15B is critically positioned and sensitive to partial solvent shielding by the phenyl ring, something that has been noted before.14
The discrepancy between the different approaches (ours included) for the tautomer set shows that it seems highly difficult to propose a single method for different tautomer pairs that performs consistently with a RMSE below 2.0 kcal mol−1. This issue is made substantially worse by the many different ways methods can be used/combined and errors can be propagated/compensated during tautomeric free energy difference ΔtGcalcsolv calculations. While we believe that a RMSE of 3.1 kcal mol−1 is a good value for the chosen approach, especially when compared to the results of the SAMPL2 challenge, it is by far not a satisfying result. The accuracy, compared to the cost of the approach, is not justifiable and there is still a dire need for more accurate and cheaper methods to obtain relative solvation free energies for tautomer pairs.
With this we close the investigation of the state of the art QM free energy calculations and will investigate some of the shortcomings of the RRHO thermochemistry calculations. Some of the insight gained above could have been concluded from the work of Kast et al.,14 Ribeiro et al.,15 Klamt and Diedenhofen,16 Soteras et al.,35 but the small number of chemical species investigated there and the often contradicting results reported elsewhere (e.g. ref. 36 and 37) made a closer investigation of QM free energy calculations for a large dataset necessary. The reported results will be of importance for further method development for tautomeric free energies.
While the scientific community agrees on the importance of the global minimum for property calculations, the importance of considering multiple conformations for quantum chemistry free energy calculations has not been well established (e.g. ref. 38). Using multiple minimum energy conformations can add substantial computational cost to the free energy calculations. Often, the global minimum conformation search can be performed with a lower level of theory than the single point energy calculation, making only a single high level electronic energy calculation necessary. Frequency calculations can also add considerably to the computational cost of the free energy calculation. If a single, minimum energy conformation is sufficient to obtain a good estimate for the free energy in gas phase substantial amount of computational time could be saved.
Fig. 6 compares the tautomeric free energy difference obtained using the weighted average over multiple minimum conformations and the single global minimum conformation. The results indicate that using multiple minimum conformations does not significantly improve the final result compared to using a single, global minimum structure. Only 12 tautomer pairs show an absolute deviation larger than 1 kcal mol−1 in the tautomeric free energy difference ΔtGcalcsolv between the two approaches. Much more relevant than including multiple conformations is locating the global minimum conformation (as clearly shown by Fig. 5).
Alchemical relative free energy calculations were performed for 354 tautomer pairs using 11 alchemical λ states in vacuum. In the following, we will compare the tautomeric free energy difference obtained using alchemical relative free energy calculations to the multiple minima, RRHO approximation using the same potential energy function (ANI-1ccx) to assess potential errors in the thermochemistry corrections. We will also show how a small number of experimentally obtained tautomer ratios in solution can be used to incorporate crucial solvent effects and recover tautomer free energies in solution by QML parameter optimization and importance weighting.
While ANI-1x was trained on energies/forces calculated with ωB97x/6-31G*(d), ANI-1ccx was retrained from the same level of theory to include energies recalculated with coupled cluster considering single, double, and perturbative triple excitations (CCSD(T)) with an extrapolation to the complete basis set limit (CBS).41 CCSD(T)/CBS is the gold standard for electronic energy prediction and calculates thermochemical properties to within the limit of chemical accuracy (e.g. ref. 42).
Since the simulation time of the individual lambda states for the alchemical relative free energy calculations were relatively short (200 ps) we repeated the calculations multiple times (5) with randomly seeded starting conformations and velocities to detect systems for which the simulation time was clearly insufficient. In the following we will only use systems that had a standard deviation of less than 0.3 kcal mol−1 for 5 independent alchemical free energy calculations. Applying this filter resulted in the removal of 65 tautomer pairs for which the free energy calculation had not converged.
Results shown in Fig. 7 indicate the average deviation that every relative free energy calculation based on the RRHO approximation introduces, regardless of the accuracy of the actual potential to model electronic energies. The mean absolute deviation of 0.9 kcal mol−1 should not be underestimated. Results shown in Table 1 would be significantly improved if an error of 0.9 kcal mol−1 could be compensated by running a protocol that samples the relevant conformational degrees of freedom to obtain an exact partition function.
For 12 (out of 289) tautomer pairs the multiple minima RRHO approximation deviated by more than 3 kcal mol−1 (molecules are shown in Fig. S.I.1†). Most of these molecules have high conformational degrees of freedom and it seems unlikely that a naive enumeration of relevant conformations (e.g. with a conformer generator) will detect all of them—this might contribute to the observed error. That is certainly true for tp_113, 116, 565, 554, 403, 1674.
In the following we want to investigate if thermodynamic observable—in this specific case tautomer ratios—can be used to retrain a neural net potential derived from QM calculations. Further, we want to test how such an optimized parameter set would perform on the original dataset used to train the neural net potential and if it is possible to use the difference between the energies of the original and optimized parameter set for regularization.
Defining a loss function L as the error between the calculated ΔtG(θ)calcvac and experimental ΔtGexpsolv free energies it is possible to optimize the loss with respect to the neural net parameters θ defining the ANI potential. Using importance weighting, a new free energy estimate can be calculated with the optimized QML parameters (θ*) using the modified potential energy function without resampling the equilibrium distributions. To ensure that the quality of the neural net parameters does not deteriorate beyond a reasonable threshold a regularization term was added that acts on the energy difference for each snapshot calculated with the endstate potential energy functions with the original and optimized parameter set (ΔE(θ, θ*)). The regularization term was included in the molecular loss function acting on the snapshots used for the free energy calculation for the 212 tautomer pairs in the training set—in the course of a single epoch ΔE(θ, θ*) is evaluated on a total of 699600 snapshots (212 tautomer pairs × 11 lambda states × 300 conformations).
Initial results led to the introduction of scaling factors that allows to slowly increase the contribution of the tautomeric free energy difference in the loss function (and/or the regularization term) during the training. The protocol is described in more details in the Methods section.
It was possible to overfit the parameters on the training set within 50 epochs if no regularization term was used (training/validation performance shown in Fig. S.I.7I and M†). Without a regularization term the RMSE for ΔE(θ, θ*) on the training/validation/test snapshots rises to 40–100 kcal mol−1. While the training set performance improves the validation set performance reaches a plateau around 2.2–2.6 kcal mol−1. The red dotted line in Fig. S.I.7† top panels indicates the performance of the reported results in Fig. 8, indicating that even without regularization this performance can not be improved significantly.
The results shown in Fig. 8 and S.I.7† indicate that there is a trade off between the accuracy of the ANI-1ccx parameters on its original training set (either directly shown in Fig. S.I.7† or indirectly through ΔE(θ, θ*) in Fig. 8) and the improvement for the prediction of tautomeric free energy differences. This is best exemplified in the training/validation set performance shown in Fig. S.I.7F and X.† For both training runs the scaling factor of the regularization term was increased throughout the epochs, shifting the focus from optimizing the free energy to keeping ΔE(θ, θ*) as small as possible. In Fig. S.I.7X† the ΔE(θ, θ*) term in the loss was kept constant throughout the first 100 epochs and then raised approximately tenfold during the next 50 epochs. Looking at the training/validation set performance it becomes evident that starting with epoch 100 the initial trend towards improved performance starts to reverse while the MAE for ΔE(θ, θ*) decreases. After 400 epochs the MAE for ΔE(θ, θ*) is near zero and there is no improvement in the free energy estimate for the training/validation set. This indicates that the parameter set approaches the same minimum that was occupied before training started.
Whenever both weights and biases are trained for the QML potential it often takes a few hundred epochs before ΔE(θ, θ*) approaches reasonable values (∼2 kcal mol−1), shown e.g. in Fig. S.I.7L, J, G, and A.† Similar behavior was observed even if the learning rate for the biases was reduced to 1 × 10−6 as shown in Fig. S.I.7T.† In some instances it was not possible to converge and successfully minimize both terms of the loss function in 400 epochs. One such case is shown in Fig. S.I.7E.†
In Fig. 8A the training/validation set performance of a training run with 400 epochs is shown. Only weights were optimized with a learning rate of 1 × 10−4, regularization was used (the scaling factors are plotted in Fig. S.I.9†). Fig. 8B shows the obtained free energy estimates with the best performing parameter set (calculated on the validation set) on an independent test set (71 tautomer pairs). The optimized parameter set (θ*) was able to improve the prediction of tautomeric free energies on the test set from initial 6.7 [5.7,7.7] kcal mol−1 with the original parameter set (θ) to 2.8 [2.2,3.3] kcal mol−1, comparable to the performance of B3LYP/aug-cc-pVTZ/B3LYP/6-31G(d)/SMD. Fig. 8B shows the distribution of . The Kullback–Leibler divergence (KL) value for indicates that the optimized parameter set is able to reproduce the distribution of the experimental free energies better than the initial parameter set.
Fig. 8A shows the performance of the parameter set on the ANI-1ccx training dataset.§ The RMSE of the parameter set for the ∼500000 data points of the dataset increases from initial 1.7 kcal mol−1 to 5.5 kcal mol−1. This increase in RMSE can be partially explained by the target of the retraining: tautomeric free energies in solution. Including solvation effects necessarily decreases the performance on a gas phase dataset.
There is a limit to the reweighting workflow, as indicated in Fig. S.I.8,† that requires to re-sample the equilibrium distributions with the perturbed parameter set. The uncertainty of the perturbed free energy estimate increases as the weights become more concentrated on fewer observations and the effective sample size shrinks.43 Using an arbitrary cutoff of 1 kT as acceptable for the perturbed free energy uncertainty Fig. S.I.8† indicates that conformations should be resampled after around 400 epochs with the perturbed parameter set.
Coming back to the question posed in the beginning of this section: yes, it appears to be possible to use experimental values of a thermodynamic observable to retrain an already optimised QML potential to improve its performance for predicting free energies. This is astonishing, considering the small set of experimental measurements used to retrain the potential (≈250 data points) that provides an unambiguous and substantial improvement in performance on the held-out test set. The learning curve shown in Fig. S.I.12† indicates that using 5% of the training data already improves the MAE by 2 kcal mol−1.
The regularization term restraining the QML parameter set so that it is still able to reproduce single point energies with comparable quality than the original parameter set (with an increase in RMSE of ∼3 kcal mol−1) offers an interesting trade off between high quality QM single point energies and highly improved tautomeric free energy differences. It appears as if the regularization term is not necessary to improve the tautomeric free energy difference estimate, but introducing such a regularization term in the loss function does also not significantly hinder the parameter set optimization (if the scaling factors for the two terms in the loss function are chosen reasonably). These results highlight the incredible potential of QML energy functions that are differentiable with respect to its parameters in addition to coordinates, enabling model tuning based on experimental and quantum chemical data to be both facile and incredibly powerful.
One possible source of error—independent of the method used to calculate the electronic energy and model the continuum electrostatics—are the thermochemical corrections used to obtain the standard state free energy. Typically, an analytic expression is used to approximate the partition function which is based on the rigid rotor harmonic oscillator approximation. To obtain the correct and unbiased partition function and compute rigorous free energy estimates we implemented an alchemical relative free energy workflow using the ANI family of quantum machine learning (QML) potentials.23 The method was implemented as a python package and is available here: https://github.com/choderalab/neutromeratio.
Using the same potential for the calculations based on the RRHO and performing alchemical relative free energy calculations we are able to show that the RRHO approximation introduces a mean absolute error of ∼1 kcal mol−1 in the calculation of the investigated relative free energies. These errors can be attributed to anharmonicity in bonded terms, difficulties to enumerate relevant minimum conformations and in combining shallow local energy wells as well as the inconsistent treatment of internal and external symmetry numbers.
The ANI family of QML and comparable QML potentials have opened the possibility to investigate tautomer ratios using relative free energy calculations without prohibitive expensive MM/QM schemes or ab initio simulations. The calculated alchemical free energies obtained using the methods implemented in the “Neutromeratio” package can be used to optimized the parameters of the QML potential. Using a small set of experimentally obtained tautomer ratios we were able to optimize the QML parameters on a training set and significantly improve the accuracy of the calculated free energies on an independent test set. The tautomeric free energies obtained with the optimized parameter set improved the RMSE from initial 6.7 kcal mol−1 to 2.8 kcal mol−1 on a hold-out test set of 71 tautomer pairs.
What should be noted here: the experimental values are relative solvation free energies, while we calculate relative gas phase free energies. To calculate correct tautomeric free energies in solution, solvent effects need to be included. In this work solvent effects are modeled by fitting the ANI QML parameters to experimental relative solvation free energies. The use of explicit solvent molecules is the preferable and recommended solution. The optimization of QML parameters on a small set of experimental free energies can be extended easily for explicit solvent simulations (in Table S.I.2† we show results for alchemical free energy calculations for 6 of the previously investigated tautomer systems in a water droplet).
Obtaining accurate free energy differences between tautomer pairs in solvent remains an elusive task. The subtle changes and typically small difference in internal energies between tautomer pairs require an accurate description of electronic structures. Furthermore, solvent effects have a substantial effect on tautomer ratios; consequently, a proper descriptor of solvation is essential. The change in double bond pattern typically also induce a change in the conformational degrees of freedom and, related, in the conformation and rotational entropy. But—despite all of these challenges—we remain optimistic that further developments in fast and accurate neural net potentials will enable improved and more robust protocols to use relative free energy calculations to address these issues.
From the dataset a subset of tautomer pairs were considered that
(1) were measured/calculated/estimated in aqueous solution
(2) had a numeric logK value between ±10
(3) had no charged species
(4) did not contain iodine
(5) only a single hydrogen and double bond change its position.
476 of the 1680 deposited tautomer pairs had these properties. We added two tautomer pairs from the SAMPL2 challenge (tautomer pair 2A_2B and 4A_4B).12 478 unique tautomer pairs were considered for further analysis. The term ‘unique’ tautomer pair was defined in the following as containing a unique combination of two molecules. We only consider tautomer pairs, not tautomer equilibria with multiple tautomer forms. In the following we will use an identifier containing the row number entry from the DataWarrier file to identify tautomer pairs in the dataset, e.g. tp_200 describing the tautomer pair (tp) at row number 200 in the original DataWarrior file.
The logK value was converted to free energies with ΔtGexpsolv = −RTlnK. The free energy difference of the tautomer pairs obtained from the Tautobase are subsequently referenced as ΔtGexpsolv in contrast to the calculated values which are called ΔtGcalc. While we call all values deposited in the Tautobase ΔtGexpsolv we want to point out that some of these values are estimated or calculated.
Unfortunately, experimental error estimates are not deposited in the Tautobase and subsequently also not modeled in this manuscript. This highlights the need for further experimental measurements of tautomer ratios.
A closer inspection of some of the outliers from preliminary calculations identified molecules with incorrect structures in the database (row entry: 1260, 1261, 1262, 1263, 1264, 514, 515, 516, 517, 1587)—these 10 tautomer pairs were subsequently removed from all QM calculations (simulations using ANI included the corrected tp_1260, tp_1261 and tp_1587). These structures are corrected in the current version of the Tautobase. Additionally, 8 tautomer pairs containing bromide (row entry: 989, 581, 582, 617, 618, 83, 952, 988) were removed from calculations performed with the basis set 6-31G(d) due to the lack of adequate parameters.
ANI-1 is parameterized on molecules containing only carbon, nitrogen, hydrogen and oxygen, making a further removal of tautomer pairs containing elements beside the aforementioned necessary.23 This resulted in 369 tautomer pairs.
The tautomer pair set used for relative alchemical free energy calculations needed one additional filter. Molecules with a stereobond that changes its position between tautomers had to be removed (this affected 15 tautomer pairs: tp_1637, tp_510, tp_513, tp_515, tp_517, tp_518, tp_787, tp_788, tp_789, tp_810, tp_811, tp_812, tp_865, tp_866, tp_867). Such stereobonds introduce additional complexity since it would be necessary to introduce restraints to define the stereochemistry during the lambda protocol.
The subset of the Tautobase used for the QM calculations can be obtained here as list of SMILES (468 tautomer pairs): https://github.com/choderalab/neutromeratio/blob/master/data/b3lyp_tautobase_subset.txt. The subset of the Tautobase used for the QML calculations can be found here as list of SMILES (354 tautomer pairs): https://github.com/choderalab/neutromeratio/blob/master/data/ani_tautobase_subset.txt. The distribution of the experimental tautomer free energies for both datasets is shown in Fig. 9.
Fig. 9 The tautomer dataset shows a wide variety of solvation free energies ΔGexpsolv. (A) shows the ANI-Tautobase subset that was used for the QML calculations and (B) shows the QM-Tautobase subset used for the QM calculations. The full dataset considered for this study was obtained from the DataWorrier File deposited at https://github.com/WahlOya/Tautobase (commit of Jul 23, 2019), described in detail in ref. 9. The selection criteria for both datasets are described in detail in Detailed methods section. |
(2) |
(3) |
The Gibbs free energy in gas phase for a given coordinate set (k) was obtained by adding thermal corrections for conformer k at temperature T and zero point energy (ZPE) contributions εZPE,K to the electronic energy Ek.22 The degeneracy D describes the entropy contribution of internal rotors and is not added for the calculations shown in the main text (results including degeneracy D are shown in the ESI†).
(4) |
The degeneracy D was estimated by calculating the graph automorphism of the molecule. The implementation of the VF2 algorithm for graph isomorphism of networkx was used.50 Nodes were defined to match if element and hybridization matched, edges were identical if bond order matched.
The tautomeric free energy difference in solution ΔtGcalcsolv can be calculated from the standard-state Gibbs free energy in aqueous phase of the product and educt of the corresponding tautomer reaction, which itself is calculated as the sum of the gas-phase standard-state free energy and the standard-state transfer free energy expressed as
(5) |
An alternative way to calculate the free energy in aqueous phase used in the manuscript is
(6) |
By default, the coordinates of the hybrid topology were generated by using the coordinates of ‘Tautomer1’ (as defined in the Tautobase database). If a tautomer isomerism created or removed a cis/trans stereobond, the initial coordinates were taken from the topology with the stereobond present (therefore sometimes changing the direction of the tautomer reaction).
The coordinates of the added, non-interacting hydrogen were obtained by randomly sampling 100 positions on the surface of a sphere (with radius of 1.02 Å) defined around its new bonded heavy atom and subsequently using the lowest energy position as the starting conformation for each lambda window for the free energy calculation. The physical endstates (representing the two tautomer states, each with an additional non-interacting (dummy) hydrogen) were connected via 11 equidistant intermediate (lambda) states.
Energy and forces were calculated using ANI1-ccx as implemented in torchani https://github.com/aiqm/torchani). The energy and force was linearly scaled along the alchemical path as a function of lambda with
E = (1 − λ)E1 + λE2 | (7) |
Because nuclei can rearrange to form distinct chemical species in highly perturbed simulations, we applied a flat-bottom harmonic restraint to covalent bonds to ensure we sampled the desired chemical species in initial and final states of the free energy calculation. The restraint was defined as
(8) |
Samples were obtained from 200 ps simulations for each lambda state. For each tautomer pair calculations were repeated 5 times with randomly seeded initial velocities (and coordinates). Relative alchemical free energies ΔGcalcvac were calculated using MBAR as implemented in the pymbar package.43 300 uncorrelated snapshots were considered for the MBAR analysis from each lambda state.
The model was trained by minimizing the mean squared error (MSE) loss between calculated and experimental relative free energies. If regularization was used, the mean absolute error (MAE) between the energy for the parameter set θ* and the original parameter set θ was calculated on the individual snapshots used for the free energy calculation (11 × 300 snapshots) with the potential energy function at λ = 0 and λ = 1. The regularization term was normalized using the number of atoms of the tautomer system. The per molecule pair (m) loss function l is defined as
(9) |
The overall loss is then
(10) |
The perturbed relative free energy ΔG(θ*)calcvac,m required for computing l(m; θ*) is calculated by importance sampling, or “reweighting” the original MBAR estimate from the original pre-trained parameters θ to the current parameters θ*. To compute this estimate efficiently at arbitrary θ*, we first collect configuration samples at a reference value θ (corresponding to the original parameters of the pretrained ANI-1ccx model) for each intermediate value λ. For each configuration sample x, we compute the reduced potential u(x, λ; θ), to form the N × M matrix of inputs to MBAR (where N is the total number of snapshots, M is number of λ windows). MBAR equations are solved to yield a vector of reduced free energies , where m indexes the intermediate λ values. The relative free energy prediction implied by the model parameters θ* is then .
To compute relative free energies at a new value of the parameters θ*, we need to compute u(x, λ = 1; θ*) and u(x, λ = 0; θ*) for all configurations x. The optimization routine evaluates the gradient of the loss function L w.r.t. θ* by automatic differentiation and updates the parameters.
300 uncorrelated snapshots per λ state were used for the MBAR estimate and importance sampling for the vacuum simulations.
• Data and notebooks to reproduce the plots/figures (release v0.2): https://zenodo.org/record/4562426
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d1sc01185e |
‡ Current address: Relay Therapeutics, Cambridge, MA 02139, USA. |
§ https://qcarchive.molssi.org/apps/ml_datasets/ |
¶ https://aiqm.github.io/torchani/examples/nnp_training.html |
This journal is © The Royal Society of Chemistry 2021 |