Kai
Riedmiller
a,
Patrick
Reiser
bc,
Elizaveta
Bobkova
a,
Kiril
Maltsev
a,
Ganna
Gryn'ova
ad,
Pascal
Friederich
*bc and
Frauke
Gräter
*ad
aHeidelberg Institute for Theoretical Studies, Heidelberg, Germany. E-mail: frauke.graeter@h-its.org
bInstitute of Theoretical Informatics, Karlsruhe Institute of Technology, Engler-Bunte-Ring 8, Karlsruhe 76131, Germany. E-mail: pascal.friederich@kit.edu
cInstitute of Nanotechnology, Karlsruhe Institute of Technology, Hermann-von-Helmholtz-Platz 1: 76344 Eggenstein-Leopoldshafen, Germany
dInterdisciplinary Center for Scientific Computing, Heidelberg University, Heidelberg, Germany
First published on 16th January 2024
Hydrogen atom transfer (HAT) reactions are important in many biological systems. As these reactions are hard to observe experimentally, it is of high interest to shed light on them using simulations. Here, we present a machine learning model based on graph neural networks for the prediction of energy barriers of HAT reactions in proteins. As input, the model uses exclusively non-optimized structures as obtained from classical simulations. It was trained on more than 17000 energy barriers calculated using hybrid density functional theory. We built and evaluated the model in the context of HAT in collagen, but we show that the same workflow can easily be applied to HAT reactions in other biological or synthetic polymers. We obtain for relevant reactions (small reaction distances) a model with good predictive power (R2 ∼ 0.9 and mean absolute error of <3 kcal mol−1). As the inference speed is high, this model enables evaluations of dozens of chemical situations within seconds. When combined with molecular dynamics in a kinetic Monte-Carlo scheme, the model paves the way toward reactive simulations.
A major challenge in predicting chemical reactivity in proteins, such as unspecific radical transfer reactions, is the molecular environment of the reaction: it determines the reactivity but is both chemically very diverse and highly dynamic. This leads to a virtually infinite number of possible reaction scenarios, in which reactants represent instances within a vast chemical and conformational space. As a consequence, directly computing this amount of radical reactions by ab initio calculations is computationally not feasible. Instead, machine learning can leverage quantum chemical calculations by predicting reactivity based on an initial quantum chemical data set. We here set out to predict the energy barriers of hydrogen atom transfer reactions in proteins using graph neural networks that are trained on computed energy barriers.
Machine learning has been successfully applied to predict structures, energies, and properties of molecules.7–11 However, for the prediction of kinetic quantities, machine learning only gained traction in the last years.12 Previous works often reported respectable accuracies below 1 kcal mol−1, but relied on inputs derived from DFT calculations.13–15 While a low number of single point energy calculations are still less expensive than the otherwise required optimizations, ideally one would want to replace all DFT calculations at inference time, be it at the loss of some accuracy. And indeed, this has been attempted for, e.g., catalysis on metal surfaces and reactions between small molecules, reaching an MAE of 5 and 2.6 kcal mol−1, respectively.16,17 One notable exception is the prediction of dihydrogen activation with Vaska's complex by co-author Friederich et al., where accuracies below 1 kcal mol−1 were achieved without the need for DFT calculated inputs, however over a very narrow range of 0–25 kcal mol−1. Taken together, predicting reaction barriers by a surrogate model which does not use DFT data as input has remained a challenge, in particular in case of a complex chemical environment and a large variation in barriers.
For the prediction of HAT energy barriers in this work, we decided to use equivariant graph neural networks, as they performed best on previous tests compared to random forest regressors, dense NNs and non-equivariant NNs. This is representative of the overall development in representations of chemical systems for machine learning. Representations such as the Coulomb matrix18 perform well on simpler tasks, but are typically outperformed by more expressive representations, e.g., atom-centered symmetry functions,7 smooth overlap of atomic positions (SOAP),19 and, in the last years, graph neural networks (GNNs).9,20 The latest accuracy increases can be partially attributed to moving from invariant to equivariant models, where we arrive at the GNN used in this work, namely PaiNN.21,22 Originally, PaiNN was designed to predict global quantities like total energy, but as the reaction barrier does not scale with system size, we adjusted the architecture for the prediction of local quantities.
In this work, we focus on predicting HAT reaction barriers within one particular protein system, collagen (Fig. 1). As shown earlier by some authors of this article, stretching collagen generates mechanoradicals within the protein.23 These radicals rapidly localize on specific protein residues, dihydroxyphenylalanine (DOPA), plausibly through a sequence of HAT reactions. Insights into the reaction pathway of radicals might help in the design of similarly durable polymer systems as collagen but are experimentally challenging to obtain. Our work is an important step towards predicting radical reactions in collagen. However, as we show in addition, the model is also applicable to proteins of different composition. We also suggest the developed workflow to be straightforwardly applicable to other, not necessarily biological, polymers.
We built thousands of molecular fragments as they occur in collagen and calculated HAT energy barriers on the level of hybrid density functional theory (DFT). The computed reaction barriers range between 0 kcal mol−1 and 175 kcal mol−1 and are highly dependent on the local environment, rationalizing the machine learning approach. We used our quantum chemical data to train the GNN, which is able to predict barriers approaching DFT accuracy.
Our machine learning model predicts the energy barrier for one selected reaction at a time while taking the chemical environment around the radical in the reactant state as input (Fig. 1). Importantly, the reactant state is directly cut out from a molecular dynamics (MD) simulation, and no optimization is needed prior to the barrier prediction, neither a single step energy calculation as input. It thus can be used as a surrogate model of hydrogen atom transfer within classical MD simulations to model radical propagation within collagen or other (bio)materials on the fly, e.g., by using hybrid kinetic Monte-Carlo and MD simulations.24 Our GNN-based approach tackles the challenge of predicting reaction barriers in a heterogeneous and dynamic chemical setting, and will likely prove useful for other complex soft matter systems.
Synthetic systems are pairs of amino acids arranged in a way that two hydrogen atoms are in a defined position to one another. As shown in Fig. 2A, the translation distance between the hydrogens, the rotation, and the tilt angle are varied. The positions of these two central hydrogen atoms represent the start and end positions of a single hydrogen atom undergoing the HAT reaction. Furthermore, intramolecular reactions are generated from within single amino acids. Combinations of hydrogen atoms with less than 4 Å distance are considered. Systems with atoms closer than 0.8 Å to the transition path are removed.
Fig. 2 (A) Build process of reactive HAT systems for the synthetic data set. The HAT reaction between two example molecules is shown. The distance between the start and end position of the transferring hydrogen (translation), the angle formed by the transferring hydrogen with the donating and accepting heavy atoms (tilt), and the dihedral angle around the hydrogen atom transfer axis (rotation) were varied to construct the synthetic systems. (B) Data set of optimized structures, built from synthetic and trajectory systems. The optimized transition state is shown, alongside the interpolated reaction path of the hydrogen in green and the start position in orange. The non-optimized structure is shown translucently. (C) A trajectory system with its environment shown translucently. The radical heavy atom is highlighted in green and the reacting hydrogen in yellow. The solid-drawn atoms at the border to the translucent environment are used in the construction of the capping groups, the translucent atoms are discarded. (D) Data distribution of the synthetic and trajectory data sets. (E) Architecture of the used graph neural network, based on PaiNN.22 (F) Rolling average of the calculated energy barriers of HAT reactions in the data set vs. the distance the hydrogen has to move during the reaction. The shaded area corresponds to ± standard deviation, dashed lines indicate bond dissociation energies for context.25 |
The generation of reactive systems from MD trajectories starts from a collagen model obtained from Colbuilder.26 The model is simulated using GROMACS 2020. In the resulting trajectory, possible reaction sites are identified by monitoring H–H distances. As energy barriers, Ea, heavily depend on the translation distance, an emphasis is put on smaller translations when sampling. The HAT candidates are cut out together with their close surrounding from the bigger system, excluding solvent molecules. To generate chemically meaningful systems and to allow reference DFT calculations, the cut-out sections of the protein are capped using N-methyl and acetyl groups. In Fig. 2C, the capping procedure is visualized (also see Fig. S3B†). For a given set of selected atoms in a trajectory, only the system with the smallest translation distance is kept, as otherwise a large amount of highly correlated systems would be generated.
At this point, the reactive systems have been defined. Further preparation steps are applied to synthetic as well as trajectory systems. One of the two central hydrogen atoms is removed and therefore acts as the starting location of the radical. Then, the reaction path is estimated by interpolating the position of the remaining hydrogen atom from its starting position to the position of the removed hydrogen atom. We originally used 11 equally spaced interpolation increments, but after the first set of calculations of 1600 HAT reactions, all barriers were found to be located within steps 4–8, 1 or 11, so steps 2, 3, 9, and 10 are omitted in all later calculations and the data used here. This means, we define the reaction paths using seven structures. Along the reaction path, the energy of the system is calculated using the hybrid functional BMK27 together with the 6-31+G(2df,p) basis set, using Gaussian 09 (rev D.01).28 Electronic energies were used throughout the study. The BMK functional was chosen, because it was designed for kinetic studies, and it has proven highly capable of accurately calculating HAT reaction barriers.29 More details on the building of the systems and the calculations are given in the ESI.†
In many cases, the guess of the transition path by interpolating between the start and end position of the hydrogen whilst keeping the rest of the system frozen is reasonable, however, it introduces a significant error in conformationally flexible systems. Furthermore, since no free radical parameters are available in the MD simulations, all resulting geometries correspond to closed-shell species (Fig. S4†). To address this, a subset of the combined synthetic and trajectory data set is optimized at the same QM level as used for the energy calculations. More precisely, as a reaction is defined by its reactants, the transition state, and the end products, those three structures are being optimized to their energy minima or saddle point, respectively. For transition states, frequency calculations were performed to ensure the existence of a single imaginary frequency corresponding to the correct reaction. During optimization, we freeze all atom positions except the donor and acceptor atoms and directly bonded hydrogens, to reflect the embedding of the reactants into the structure of the material, here the protein backbone, and to prevent contributions to the calculated energy barrier from rearrangements in the reactants unrelated to the HAT. As the model is intended to be used on many MD trajectory snapshots, we want to restrict the DFT optimization to degrees of freedom unable to be sampled in MD, namely geometry changes due to closed-to open-shell electronic structure transition. Degrees of freedom that can be sampled in MD, like rearranging side chains, shall not be sampled during the DFT optimization, but in the preceding MD simulation. To achieve this, we tested differently sized optimization regions, details are described in the ESI,† including illustrations and statistics in Fig. S5†
In this work, 4393 synthetic structures and 5261 structures from trajectories are generated. Of these structures, 10% are set aside randomly for testing. 803 structures form the optimized data set, most of them (725) originate from the trajectory data set, 78 are synthetic systems. They inherit the test/training set membership from their non-optimized versions, to ensure no structures from the training leak into the test set. Note that each structure has two associated energy barriers (one forward, one backward reaction), resulting in twice the amount of energy barriers for training/testing. In summary, 17370 energy barriers are used for training, 1938 for testing. Fig. 2D shows the distribution between the data sets.
Unsurprisingly, the calculated barriers show a strong dependence on the distance the hydrogen atom has to travel during the reaction, as can be seen in Fig. 2F. However, barriers vary significantly for a given translation, substantiating the need to use more complex system descriptions than just translation.
The synthetic data, by construction, includes data at translations down to 0.3 Å and with barriers smaller than 20 kcal mol−1, cases not covered in the trajectories, where small interatomic distances are disfavored and thus rare. Translations found in trajectories start from 0.7 Å, but the vast majority are larger than 1.2 Å (Fig. S1B†). Out of the 10522 reactions from trajectory systems, only 24 reactions were found with barriers below 20 kcal mol−1. In 21 of these systems, the reaction involves at least one hetero atom. Also, the translations are comparatively short, with an average distance of 1.2 Å. Examples of the lowest and highest barriers can be found in Fig. S2.†
An ensemble of models was trained on all available training data without a cutoff applied, and evaluated on the whole test data, and only on the trajectory systems of the test data, each evaluation with a cutoff applied. We chose to evaluate on the trajectory data as a major application is the prediction of HAT barriers for conformations of proteins (here collagen) encountered during MD simulations.
Performance metrics are summarized in Fig. 3. Panels A and B show the performance for trajectory systems with translations below 2 Å, i.e., focus on the most feasible HAT reactions, while panels D to F show measurements using all available trajectory data. For completeness, evaluations on synthetic data alone are shown in Fig. S6.†
As can be seen from Fig. 3A, we achieve MAEs of 2.4 ± 2.5 kcal mol−1 using the ensemble model. Individual models only achieve 2.7 ± 2.7 kcal mol−1 on average on the trajectory data with the translation cutoff in place (Fig. 3B).
The prediction quality heavily depends on the amount of training data available, as shown in Fig. 3C. Adding training data improves the model up to ∼90% of the available data, which corresponds to 14068 individual barriers. Thus, the amount of training data generated at the BMK/6-31+G(2df,p) level is approximately required to reach this accuracy, but also appears to suffice, as the learning curve flattens towards the end. Training curves are shown in Fig. S7.†
The accuracy decreases for systems with bigger translations and energy barriers, as shown in Fig. 3D and E: increasing the translation cutoff from 2 Å to 3 Å introduces more high-barrier systems to the test set, which seem to be harder to predict exactly. For the prediction of the propagation pathway of a radical in a complex environment, this might be acceptable though, as reactions with high energy barriers are unlikely to occur under ambient conditions.
As mentioned, the use of an ensemble model also brings the advantage of an uncertainty measure: the standard deviation between the models. In Fig. 3F, the absolute ensemble error is plotted against the ensemble standard deviation together with a rolling average. For a low standard deviation (smaller than 1.7 kcal mol−1), one can assume a low prediction error (<3 kcal mol−1) quite confidently. On the other hand, higher standard deviations no longer scale reliably with the error.
Notably, the energy barriers of synthetic systems change less when optimized compared to barriers from trajectory systems. This is likely due to higher atom density in trajectory systems: it is more likely in trajectory systems that atoms interfere with the transition path. In other words, the lowest energy reaction path changes more in trajectory systems relative to the interpolated path. This analysis also serves as a validation for the procedural structure building process. If the built structures were unreasonable, they would change more drastically during optimization compared to the structures produced by MD simulation. At a given translation, the optimized reactions generally show lower barriers (see also Fig. 2F).
The ensemble model for predicting optimized barriers achieves an MAE of 3.6 ± 3.2 kcal mol−1 on trajectory data with translations of less than 2 Å and 4.9 ± 4.0 kcal mol−1 on translations less than 3 Å (Fig. 5A and B, respectively). The learning curve of the transfer learning procedure in Fig. 5C suggests that the model is data limited, as the accuracy increases in particular from 90% over 95% to 100% of the optimized test data. Training curves are provided in Fig. S8.†
For predicting the barriers, the ensemble model trained on collagen trajectories and synthetic systems was used. As shown in Fig. 5D, the model does not perform as good as on its training target collagen, but still achieves an MAE of 4.6 ± 4.8 kcal mol−1. We note that collagen only consists of triple helices and the model has not seen other secondary structures, let alone intrinsically disordered proteins. On top, the amino acid distribution varies significantly between the training data set and the testing data used here (Fig. S9†). Therefore, this is close to the worst case scenario, and yet the model still delivers useful results.
More importantly, starting from the pretrained model, the performance can be improved at a low cost in a transfer learning scheme as outlined earlier. Using only 500 new training structures from the FERM F0F1 domain, the MAE decreases already to 3.7 ± 4.2 kcal mol−1 (Fig. 5E).
A limitation of the depicted method is that the type of reaction one can predict is predefined. The model predicts hydrogen atom transfer only, while we can not rule out proton coupled electron transfer to play a role in this system.
Further, tunneling effects are ignored, and all QM calculations are performed in the gas phase. However, these last two points are limitations of the dataset, not the underlying method. As soon as higher quality data is available, the network can utilize these by, e.g., transfer learning. We thus propose our GNN-based model as a starting point to build more refined models such as one taking quantum effects into account.
Another limitation of the input data set is the use of MD structures without force field parameters for the radicals. We correct these structures implicitly in the GNN by training with barriers obtained from DFT optimization in the transfer learning step. While DFT would still be more accurate than MD alone, having proper parameters for the radical available would lower the complexity of the task the model is facing, since the input structures would be more similar to the DFT structures. We would expect an acceptable accuracy even when omitting the transfer learning step, and also an increase in data efficiency. Methods like Espaloma37 could be used in the future to obtain such force field parameters on the fly after every HAT reaction.
To justify the use of an arguably complex graph neural network, we compared it to two simpler methods, a densely connected feed-forward neural network and a random forest model (as implemented in scikit-learn,38 see ESI†). For both methods, we used the L-MBTR33 descriptor as input and the barriers computed for non-optimized reactions as targets. Details on the network architecture and input generation are given in the Methods section. Using an ensemble of ten models, the feed-forward neural network accomplishes an MAE of 4.8 kcal mol−1, and single models achieve 5.1 kcal mol−1 on average over trajectory data (see Fig. S10†). The random forest model achieved an MAE of 6.6 kcal mol−1 (Fig. S11†). Taken together, these results highlight the need for more sophisticated representations and models to capture subtle structural differences. However, it is important to note that our model significantly outperforms semi-empirical methods, in both accuracy and speed. More specifically, we compared barriers calculated with the cheaper GFN2-xTB method to BMK-calculated ones on 100 randomly chose structures.39 xTB calculations of the barrier do not reach the needed quality as they differ to BMK calculated barriers by an MAE of 20 kcal mol−1 (Fig. S12†).
Our ultimate aim is to model the chemistry of radical-induced damage to collagen, whilst simultaneously capturing the dynamic nature of this system. Kinetic Monte-Carlo (KMC) method enables incorporating reactions into MD on timescales beyond those covered by conventional MD simulations. A hybrid KMC-MD approach models reactions in a Markov-process, allowing arbitrarily big time jumps between reaction steps.40 Previously, we coupled KMC with MD to simulate homolytic bond rupture in stretched collagen fibrils in a method called KIMMDY.24 Our GNN-based model for predicting reaction barriers allows applying the KIMMDY approach to radical transfer reactions. Inferring an energy barrier of a reaction from a trained neural network within a reactive MD simulation substitutes the otherwise computationally costly quantum chemical calculation, and only marginally compromises the efficiency of standard MD simulations.
In the past, several methodologies were developed to achieve reactive MD, including reactive force fields, such as ReaxFF41 and AIREBO,42 hybrid quantum mechanical/molecular mechanical (QM/MM)43 simulations, and, more recently, molecular dynamics simulations paired with machine-learned force fields (MLFF).44,45 However, all these methods are slower compared to regular MD46 and are by default restricted to reactions on the timescale of the simulation. KIMMDY overcomes these drawbacks but relies on the availability of reaction rates, which can now be provided with the model introduced here. Implementation and results of the extended KIMMDY software are beyond the scope of this paper, and will be published later.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3sc03922f |
This journal is © The Royal Society of Chemistry 2024 |