Neil R.
McFarlane
and
Jeremy N.
Harvey
*
Department of Chemistry, KU Leuven, B-3001 Leuven, Celestijnenlaan 200f, 2404, Belgium. E-mail: jeremy.harvey@kuleuven.be
First published on 26th January 2024
In this work, we have implemented the single-ended growing string method using a hybrid internal/Cartesian coordinate scheme within our in-house QM/MM package, QoMMMa, representing the first implementation of the growing string method in the QM/MM framework. The goal of the implementation was to facilitate generation of QM/MM reaction pathways with minimal user input, and also to improve the quality of the pathways generated as compared to the widely used adiabatic mapping approach. We have validated the algorithm against a reaction which has been studied extensively in previous computational investigations – the Claisen rearrangement catalysed by chorismate mutase. The nature of the transition state and the height of the barrier was predicted well using our algorithm, where more than 88% of the pathways generated were deemed to be of production quality. Directly compared to using adiabatic mapping, we found that while our QM/MM single-ended growing string method is slightly less efficient, it readily produces reaction pathways with fewer discontinuites and thus minimises the need for involved remapping of unsatisfactory energy profiles.
In studying these reaction pathways, it is often desirable to include in the model not only the substrates and the catalytic site amino acid residues, but also the wider protein and solvent environment. Hybrid quantum mechanics/molecular mechanics (QM/MM) is one of the commonly used methods that can satisfy this.10–14 However, by its very nature, a hybrid QM/MM model invokes a significant number of degrees of freedom, and thus finding the minimum energy pathway (MEP) between reactant and product is seldom straightforward.
In the QM/MM formalism, one may obtain the MEP as either a free energy surface by using dynamic techniques, or as a potential energy surface through the use of static techniques.15 While dynamic techniques such as umbrella sampling,16–18 free energy perturbation methods,19–22 and thermodynamic integration23–27 provide an attractively systematic approach to describing reactivity, they are also quite demanding in a QM/MM context, since they require sampling of the potential energy surface at the QM/MM level. Doing so adequately in the context of QM/MM simulations may become computationally unfeasible when using more accurate QM methods.23 While various techniques exist to minimise this computational expense, static techniques are attractive in that they do not require simulation at the QM/MM level of theory. Instead, starting from one or more initial structures, a MEP of discrete structures leading from reactants to products is obtained, and the potential energy barrier along this path is used to estimate the activation free energy for the reaction.28 This estimate of the free energy barrier is not exact, and, as will be discussed below, meaningful estimation of the barrier often requires obtaining multiple MEPs, somewhat restricting the relative computational efficiency compared to dynamic approaches. Nevertheless, these static approaches are valuable, and improved algorithms for identifying MEPs are hence desirable.
Broadly, two classes of static technique to find a MEP exist: double- and single-ended methodologies. Double-ended methodologies start from a pair of structures, often corresponding typically to reactant and product minima, and then iteratively connect them via a MEP.29 Single-ended methods only require a single starting structure, typically a reactant or product minimum. One also needs to provide some definition of the desired reaction coordinate.7,29
With a double-ended methodology, the reactant and product structures need to be configurationally coupled, i.e. the two input structures need to be as geometrically similar as possible given the required changes in chemical bonding occurring during the reaction. This can be viewed as an application of the principle of least motion.30 Rather problematically, due to the very large number of degrees of freedom, independently-generated reactant and product structures are very likely to differ in many ‘spectator’ degrees of freedom.31
On the other hand, careful application of single-ended methodologies has the advantage that the obtained MEP is more likely to connect two structures that are configurationally coupled in the sense discussed above. However, the large number of degrees of freedom to be optimised means that during optimisation of the MEP with single-ended methods, structural changes in ‘spectator’ degrees of freedom are frequently observed, leading to MEPs that include one or several discontinuities in the energy profile.7 One contributing factor to this behaviour is the definition of a reaction coordinate once and only once at the start of the pathway optimisation. In cases with conformationally flexible groups or solvent molecules near the reacting site, many degrees of freedom can couple strongly to the bonding change and accurately defining the reaction coordinate can be challenging, and less accurate reaction pathways may be obtained.7 Hence, there is scope for a single-ended QM/MM technique which improves upon the method for construction of a reaction coordinate, and this would be expected to improve MEP quality.
The growing string method (GSM) was first proposed in 2004 by Peters et al.32 as an improvement upon other double-ended techniques such as the nudged elastic band (NEB)33,34 and other string methods.35,36 This double-ended variant of GSM (DE-GSM) was developed further, and used in a number of computational investigations.37–40 However, the need for configurationally coupled reactant and product structures means that double-ended reaction pathway optimisation techniques are not always readily applicable to large systems. In 2015, Zimmerman developed a single-ended variant of GSM (SE-GSM).41 It is broadly similar to the DE-GSM approach, with both algorithms having a two-stage methodology in which the pathway is first ‘grown’ by adding discrete structures along the pathway, and then optimised with an additional reparameterization step between optimisation cycles.39,41 The method uses a delocalised internal coordinate (DLC) scheme, which allows for the reaction coordinate constraints to be expressed in a flexible way as a combination of all relevant internal coordinates.42–44
In this work, we describe what is, to the best of our knowledge, the first implementation of SE-GSM within the QM/MM framework. We use a hybrid coordinate system, where the QM region is described by DLCs, and the MM region by Cartesian coordinates.45 As mentioned above, the SE-GSM algorithm has two stages, where the initial reaction coordinate used during the growth phase is user-defined (this phase is in many ways similar to the adiabatic mapping method14), and the reaction coordinates used in the subsequent optimisation phase are allowed to vary along the pathway. This variable reaction coordinate should, in principle, have an improved propensity to obtain continuous MEPs.
As an initial test and validate of our implementation, we have selected a well-understood biomolecular reaction, the Claisen rearrangement catalysed by chorismate mutase (CM), where the substrate chorismate is converted to prephenate in a single step.46–49 The choice was based upon a number of factors, including ease of system preparation, wealth of computational investigations already performed, and system size. The mechanism of catalysis by CM does not involve any covalent bond formation between the substrate and the enzyme, as the enzyme is a member of the rare class of pericyclic catalysing enzymes.50,51 In bacterial systems, CM plays a part in the shikimate metabolic pathway, ultimately leading to the production of aromatic amino acids phenylalanine and tyrosine.52,53 The reaction mechanism is illustrated in Fig. 1.
Perhaps the most computationally investigated form of chorismate mutase is that from Bacillus subtilis (BsCM),54 presumably due to availability of high-resolution crystal structures containing transition state analogues (TSA). Consequently, BsCM was the enzyme used in this work. Extensive QM/MM studies have been carried out on this reaction using BsCM as the starting structure, so the TS relative energy and structure are well known,55–66 hence providing an excellent benchmark for our study.
In this paper, we describe a new implementation of the SE-GSM methodology within the QM/MM framework (in our in-house QM/MM code QoMMMa).67 We find that the dynamic definition of the reaction coordinate during pathway optimisation improves reaction pathway smoothness, at least for the case of BsCM. We also find that the SE-GSM approach provides an easier and more robust single-ended method for evaluating reaction pathways compared to the adiabatic mapping approach. We describe below the algorithmic details of this implementation, including the coordinate system used and the SE-GSM methodology, as well as the details of the computations on the BsCM test case, and evaluate the quality of the reaction pathways obtained in that case.
The process of generating and constraining a set of DLCs has been described previously,42–45 here only a brief outline is given. To generate a set of DLCs, primitive internal coordinates are first generated completely automatically. This set of nq primitive internal coordinates, q, fully describe the internal degrees of freedom of a system of n atoms and nx = 3n Cartesian coordinates. Our implementation allows for q either a complete set of interatomic distances (a ‘total connectivity’ scheme) or a covalently defined set of bonds, angles, and dihedral torsions. To describe the system accurately, nq must be at least equal to nx − 6, or nx − 5 for linear molecules. It may, however, be greater, and the precise choice made is not critical for the optimisation algorithm.42
In our implementation, q and the subsequently formed DLCs are only generated for the QM region – a previously validated partitioning scheme for QM/MM calculations.45 In an additive scheme, the QM and MM regions are independently optimised, but are polarised by one another via point charges. This independent optimisation means that treating the two regions with different coordinate types is straightforward. Any link atoms introduced by scission of a covalent bond across the QM-MM boundary can be seamlessly included in the generation of q and hence DLCs. Moreover, the microiterative scheme as implemented in QoMMMa where the entire MM region is optimised at each macrocyle67 is preserved whilst still optimising the QM region with DLCs, thereby improving performance at low computational expense.70,71 Optimisation steps involve updating initially the DLCs, which can be seamlessly converted to a step in q space. The resulting updated structures then need to be converted back to Cartesian coordinates, as will be discussed below. Coordinate interconversion steps make use of the Wilson B matrix (see eqn (1)), whose elements are structure-dependent.72
(1) |
Forming the DLCs uses the symmetric matrix, G, the product of B with its transpose (see eqn (2)).72 Diagonalisation of G (eqn (3)) yields nx − 6 non-zero eigenvalues, and corresponding eigenvectors, which form a sub-matrix U of P, and describe the non-redundant combinations of q. The remaining eigenvectors spanning the redundant space are discarded.
G = BBT | (2) |
G = P−1ΛP | (3) |
The DLCs, s, of which there are nx − 6, are formed by linear combinations of q as described by U as shown in eqn (4).
s = UTq | (4) |
This coordinate set, s, can be taken as the working set for any optimisation procedure. The gradient and (optionally) Hessian that are needed for optimisation in DLC coordinates are obtained as previously described.42 In our implementation, optimisation of the QM region can be carried out using either first-derivative based steepest descent (SD), or the second-derivative based Broyden–Fletcher–Goldfarb–Shanno (BFGS) scheme.73,74 A combination of a few steps of SD then switching to BFGS is currently found to work well, though implementation of other often more efficient algorithms74–77 is also possible. After each structure update for the QM region, the DLCs must be back-transformed to Cartesian coordinates, and problems in this step are avoided by limiting the magnitude of optimisation steps, in a manner that has been previously recommended and applied successfully.78
For reaction pathway optimisation, constraints are needed, and their introduction in a DLC-based algorithm is relatively straightforward. The constraint imposed can, in principle, be applied to any linear combination of several primitive internal coordinates,42,43 and can be imposed exactly. The constraint is expressed initially as a linear combination of primitive internal coordinates in terms of the unit length vector C, which is then is projected onto DLC subspace as shown in eqn (5).
(5) |
The orthogonal complement of the (nx − 6) vectors of the DLC transformation set, U, with respect to the normalised Cproj then form an nx − 7 space in which to perform optimisation. This exact projection approach is typically more accurate than alternative approaches based on applying geometrical restraints in terms of Cartesian coordinates.79
In the growth phase, the user defines an initial, guess reaction coordinate known as the primitive driving coordinate, qinit. qinit may be a combination of any number of primitive internal coordinates, and it does not need to describe the reaction perfectly. It should, however, capture the anticipated changes in chemical bonds, angles, and dihedral torsions. Unlike in usual single-ended methodologies, the range over which the reaction coordinate should be incremented is not required as input but handled internally. In order to generate a new node (i.e., to ‘grow’ the string), for each element of qinit, linear steps of 0.1 Å and 5° are taken for bonds and valence and/or dihedral angles, respectively. These steps may be either in the positive or negative direction depending on the exact definition of qinit. While the size of these steps is the default, we note that for cases where qinit contains many components, it may be desirable to reduce the magnitude of the overall step taken so that the structure of the newly generated node is not severely different from that of the previous one. For the system studied here, we have verified the magnitude of these steps by performing five pathway optimisations using half the step-size. We found that, in this particular case, the five pathways obtained were similar in shape and showed the same barrier height – see Fig. S7 (ESI†). However, we note that for cases where the TS is structurally very similar to either reactant or product, it may be preferable to take smaller steps.
In detail, the growth phase operates as follows: starting with primitive internal coordinate structure qi, structure qi+1 is formed by taking a linear displacement along qinit. Subsequently, qi+1 is converted to DLCs (forming si+1), constrained using eqn (5) (where, C = qinit), and QM/MM optimisation cycles are performed, where the convergence criteria used here are relatively ‘loose’ (see the ESI† for these criteria), and a maximum of 5 optimisation steps per node is taken. This cycle is repeated until the QM/MM energy of structure qi+1 is lower than the QM/MM energy of structure qi – hence a barrier has been surmounted. In this way, it is worth mentioning that this algorithm is limited to reactions with a measurable potential energy barrier. The growth phase is completed by first adding an equal number of nodes on the other side of the barrier and performing a constrained optimisation on each of these nodes, also following qinit. However, as postulated by Hammond,80 other than for thermoneutral reactions, barriers are not symmetric. Therefore, to ensure that the product has been reached, an unconstrained optimisation is performed on the final node in the string. In the following optimisation phase, this final node is fixed, meaning that the node density along the string may change in the following stages of the algorithm. In the description thus far, SE-GSM bears resemblance to adiabatic mapping, but the pathway obtained at this stage is much more approximate than with adiabatic mapping,14 and the pathway will subsequently be refined in the optimisation phase.
In the optimisation phase, qinit is discarded and, instead, the change in q for each pair of neighbouring nodes qi and qi+1 is taken and used to define the reaction coordinate, henceforth τi,i+1, as shown in eqn (6).
(6) |
Not every single element of q will be included in τi,i+1, but only those which exceed an internally-defined cut-off threshold going from structure qi to qi+1. By default, these cut-offs are defined as 0.05 Å and 2.5° for bonds and valence and/or dihedral angles, respectively. To minimise any possibility of kinks in the string, Henkelman's upwinding principle is obeyed, wherebyτi,i+1 is defined to point in the direction of increasing energy,81 as indicated in eqn (6). This definition of the reaction coordinate is dynamic, and captures aspects of the reaction coordinate which were not predicted initially in qinit. In this phase, no new nodes are added, but each node is constrained using eqn (5) (where now, C = τi,i+1), and optimised this time with usual convergence criteria (see the ESI† for these criteria) and now a maximum of 25 optimisation steps. Since the reaction coordinate is no longer equivalent at each node, the spacing of nodes will not be even, so a reparameterization step is taken. During this step, the overall difference in q between the reactant and (now fixed) product node is calculated, and each node is linearly spaced along this tangent. The cycle of optimisation phase and reparameterization step is repeated until the constrained QM/MM optimisation of every node on the string is converged, and SE-GSM algorithm is complete.
With the algorithm complete and a MEP obtained, the activation energy can be easily approximated. It is worth noting that, in its current implementation, the activation energy and hence the TS geometry are strictly-speaking an approximation as an eigenvector following TS search algorithm has not yet been implemented.40,45,82 This is something we bear in mind for future development, but the primary goal of this work was to improve the quality of reaction pathways generated.
Given the algorithmic similarity,32,39,40 we have also implemented DE-GSM into QoMMMa. However, due to the aforementioned reasons, it is not necessarily useful for biomolecular QM/MM reaction pathway calculations. It may, however, be useful in other QM/MM reaction pathway applications. QoMMMa has been made publicly available on GitHub with all that has been described in this work.‡
In preparation for the QM/MM calculations, water molecules further than 15 Å from the centre-of-mass of the chorismate co-factor were deleted giving a total of 6010 atoms. An energy minimisation (with an RMS gradient criterion of 0.001 kcal mol−1 Å−1) was performed on the resulting large cluster model using the same CHARMM36/CGenFF93,99 forcefield in the Tinker program package,100 with atoms > 10 Å from the centre-of-mass of the chorismate frozen. For all QM/MM calculations, the calculations for QM and MM regions of the system were performed with two different packages – Gaussian 09101 and Tinker,100 respectively. The QM and MM parts of the system are combined as a QM/MM calculation using our in-house package QoMMMa.67 The chorismate co-factor was chosen as the QM region, and the remaining protein and solvent was the MM region – resulting in no link atoms between the QM and MM regions, and a QM region size of 24 atoms. For the QM region, all calculations used B3LYP/6-31G(d)85–89 and Grimme's D3 dispersion correction with the BJ damping function.90,91 The MM region was described using the CHARMM36 forcefield.99 For each snapshot, an unconstrained optimisation using DLC for the QM region with usual convergence criteria was performed (see the ESI† for these criteria) was performed to ensure a meaningful starting structure for the pathway evaluations. The primitive driving coordinate, qinit selected was the difference between the forming bond and the breaking bond, i.e. qinit = d(C3–O7) − d(C1–C9) (illustrated in Fig. 1). To minimise any structural changes in distant MM atoms, and ultimately improve reaction pathway continuity, any atoms which were > 10 Å from the centre-of-mass of chorismate were frozen.7
To understand the accuracy and efficiency of the QM/MM SE-GSM versus adiabatic mapping, we have calculated five reaction pathways using adiabatic mapping and compared the number of gradient evaluations to SE-GSM reaction pathways which used the same starting geometry. Looking at the barrier heights, both techniques predict approximately the same barrier height, with average difference in barrier heights across the five pathways being 0.6 kcal mol−1. Turning to the question of efficiency, the average number of gradient calls required to reach a smooth pathway were 745 and 627 for SE-GSM and adiabatic mapping, respectively, so that the latter is slightly more efficient. However, SE-GSM is a more streamlined procedure as one does not need to manually define a reaction coordinate with appropriate range, but instead one provides a guess of the reaction coordinate components. Also, the five pathways generated from SE-GSM were already smooth and therefore meaningful, whereas two of the pathways generated using adiabatic mapping were initially not continuous and required refining. This is a common experience with adiabatic mapping studies of reaction pathways and may be due to the fact that the constraint is applied along the same pre-assumed reaction coordinate along the whole reaction pathway. In any case, discontinuities in QM/MM energy profiles are quite common when using adiabatic mapping, so that to achieve a smooth profile, one needs to perform quite tedious reverse mapping.7 Hence, even though the total number of gradient calls required to generate smooth pathways was slightly lower with adiabatic mapping, the ease of generating smooth pathways with SE-GSM is a considerable benefit. In general, we attribute the quality of pathways generated by SE-GSM to the relatively ‘loose’ convergence criteria in the growth phase (see the ESI† for these criteria). This means that discontinuities in the QM energy profile are less likely to occur as compared to adiabatic mapping. Moreover, we expect that any discontinuities in the QM energy profile which emerge following the growth phase should be smoothed-out by the flexibility of the optimisation phase reaction coordinate. The primary limitation in generating quality pathways lies in the choice of qinit, as it must represent the main changes in chemical bonding for the reaction under study – something which is not always trivial to understand. However, the selection of an appropriate qinit can in principle be done automatically.102–104
We have used two commonly used methods to evaluate the average barrier height across all 117 snapshots: the arithmetic average of all barrier heights 16.8 kcal mol−1, while exponential averaging yields a value of 12.0 kcal mol−1. The exponential averaging technique has been shown to be preferable for approximating the free energy of the barrier, as lower energy barriers contribute more to the average barrier height,28 hence our discussion below will focus on exponential averaging (even though this does not yield a rigorous estimate of the activation free energy). The calculated barrier height of 12.0 kcal mol−1 calculated herein agrees well with both the average of previous DFT-based computational investigations of 12.7 kcal mol−1, and with the experimental barrier of 12.7 kcal mol−1 – as summarised in Table S1 (ESI†) and the accompanying text. The key bond lengths (including their standard deviations) for the reactant, TS, and product are summarised in Fig. 2.
The predicted TS structures are rather similar to those reported previously, e.g. q1 = 2.14 ± 0.05 Å and q2 = 2.64 ± 0.05 Å from ref. 64 or q1 = 2.28 Å and q2 = 2.58 Å from ref. 61. Given this favourable agreement, and the rather low standard deviations of the key bond lengths around the TS, we have some confidence that the QM/MM SE-GSM with hybrid DLCs approximates the TS well for this reaction. With reference to Fig. 2, one may note that the standard deviation of q2 for the reactant structures is larger than that of any other primitive internal coordinate, indicating that some reactant structures are considerably ‘closer’ to the TS than others, reflecting also the unbiased nature of the sampling of reactant structure from the MD trajectory. Indeed, the variance of the obtained barrier heights and reactant structures is greater than in previous investigations of this system, which typically involved some degree of manual selection of initial structures.
For all pathways, we have also calculated the so-called near-attack conformation (‘NAC’) angle105 for the reactant geometry and plotted this against the barrier height – see Fig. S2 (ESI†). The activation energy has been suggested to be directly dependent on this ‘NAC’ angle, where a lower ‘NAC’ angle suggests that the reactant geometry is predisposed to form the TS, thus lowering the barrier height. In this work, we note some (albeit quite small) correlation between the ’NAC’ angle and the barrier height, where a lower angle gives a lower barrier height. Since all the reactant conformations used in this work were in the pseudodiaxial form of chorismate, and therefore all more-or-less a ’NAC’, seeing only a minor correlation is not unexpected. An alternative theory argues that the primary effect in lowering the barrier height is TS stabilisation by the enzyme environment via electrostatic preorganisation.60–62,106 For the present purposes, whether the ‘NAC’ or electrostatic preorganisation theory is the main barrier lowering effect is perhaps less important given that the structural features leading to TS stabilisation are also expected to stabilise ‘NAC's and hence lead to their observation.
When compared to adiabatic mapping, our implementation of SE-GSM offers a dynamic definition of reaction coordinate during the optimisation phase, and the improvement over adiabatic mapping can largely be ascribed to this. This benefit is less important in a simple case study as described herein, but for calculations on more complex biochemical systems, this may be critical to achieve a smooth MEP. The effect of the dynamic reaction coordinate can be clearly seen in the components of the reaction coordinate as the optimisation phase proceeded, where a number of additional interatomic distances emerged as important. These are summarised alongside the original primitive internal coordinates within qinit in Table 1.
Internal coordinate | Definition | Distance (Å) | ||
---|---|---|---|---|
Reactant | TS | Product | ||
q 1 | d(C3–O7) | 1.49 ± 0.01 | 2.22 ± 0.09 | 3.63 ± 0.08 |
q 2 | d(C1–C9) | 3.54 ± 0.12 | 2.62 ± 0.06 | 1.60 ± 0.00 |
q 3 | d(C1–C11) | 1.35 ± 0.00 | 1.40 ± 0.01 | 1.51 ± 0.00 |
q 4 | d(C2–C3) | 1.50 ± 0.01 | 1.38 ± 0.01 | 1.34 ± 0.00 |
q 5 | d(O7–C8) | 1.39 ± 0.01 | 1.28 ± 0.01 | 1.22 ± 0.00 |
q 6 | d(C8–C9) | 1.34 ± 0.00 | 1.39 ± 0.01 | 1.53 ± 0.01 |
From Table 1 it can be seen that not only do the components of the initial driving coordinate play a part in the reaction, but so do some other relevant distances, albeit to a lesser extent. In this way, we have included other reaction coordinate components entirely automatically. The components of the initial primitive driving coordinate (q1 and q2) continue to have the largest weight in the reaction coordinate, but this is to be expected given the relatively simple nature of this reaction. However, for more complex biochemical reactions in which one is unsure of which internal coordinates aid in reaction catalysis, their contribution could be captured during pathway optimisation. There are no discontinuities in q1 and q2, and the same is true for all other interatomic distances shown in Table 1 (see Fig. S3 and S5, ESI†). We have also plotted the variation of q1 and q2 for four sample pathways, highlighting the smoothness of the reaction coordinate for individual pathways – see Fig. S4 (ESI†). Given that this was generated only from an initial guess of the driving coordinate, not a specific constrained range nor any biased selection of MD snapshots, the result is quite pleasing. Besides these distances, a large number or valence and dihedral angles also played a part in the reaction coordinate. Regardless of the reaction, this will, always happen as there are many valence and dihedral angles dependent on the position of a given atom. Due to their large number, we do not report all of them here, but we have plotted the two known important dihedral angles found in this reaction (see Fig. S6, ESI†), aligning well with previous investigations.64
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3cp05772k |
‡ The QM/MM program described herein, QoMMMa, has been made publicly available on GitHub: https://github.com/NeilMcFarlane-gh/QoMMMa. |
This journal is © the Owner Societies 2024 |