Anjana
Puliyanda
,
Arul Mozhi
Devan Padmanathan
,
Samir H.
Mushrif
and
Vinay
Prasad
*
Department of Chemical and Materials Engineering, 9211 116 Street NW, Edmonton, Alberta T6G 1H9, Canada. E-mail: vprasad@ualberta.ca
First published on 17th April 2024
Configuration changes in the solvent or melt-phase (condensed phase reactions) molecules impact reaction thermodynamics and kinetics, making it vital to assess if solvent/melt-phase molecules need to be considered explicitly in first principles-based reactive molecular simulations. A basis for these configuration changes is established using MD simulations of melt-phase cellobiose decomposition. A 3d CNN autoencoder is trained to extract spatio-temporal features from coordinates of atomic positions in the MD trajectories of cellobiose decomposition. The differences between the encoded reactant and product features were fit to probability distributions, where larger configuration changes were found to be more probable at lower temperatures. The machine learning model then predicts changes in solvent orientation by using a distance-based classifier to assess the closeness between encoded features from reactant trajectories of cellobiose systems with larger configuration changes and those from the following systems: (i) fructose protonation in water–DMSO and, (ii) glucose isomerization via hydride transfer in water and methanol. The extent of solvent configuration changes in the fructose systems was predicted to increase with DMSO concentrations and was validated using trends in the difference between reaction free energies. For glucose isomerization, configurational changes in pure methanol were predicted to be higher than that in water, consistent with the high polarizability of methanol due to which the reaction free energy barrier is ∼50 kJ mol−1 higher than that in water. This work demonstrates a machine learning framework that has the potential to limit the computational cost and accelerate the deployment of molecular simulations in screening solvents for reactive chemical transformations.
In heterogeneous catalysis, solvent effects arise from competitive adsorption of solvents and reactants, changes in internal diffusion of porous catalysts, and entropic effects due to confinement or increasing catalyst stability/durability.2 Similarly, in the condensed phase with homogeneous catalysis, various chemical mechanisms are involved in changing the activation barrier by affecting the relative stabilization of the reactant and/or TS. The specific mechanism involved depends both on the type of solvents and the reaction. For example, the cracking of polystyrene between 350 °C and 400 °C depends heavily on the hydrogen donating ability of the solvent.5 Protic solvents inhibit chain-end scission while phenols promote random scission of polystyrene. Such solvent effects are also measured in thermochemical melting of cellulose6 above 450–500 K.7 High speed photography of cellulose pyrolysis on a catalytic surface captured a short-lived liquid intermediate before complete volatilization.8 In this “liquid cellulose”, cellulose is both the solvent and reactant–solute. Depending on the thermal expansion of the cellulose lattice (solvent), different mechanisms are promoted, as measured using millisecond-scale reaction kinetics,6,9 and supported by the calculated free energy barrier.10 It has been suggested that cellulose decomposes via chain-end scission at low temperatures and random scissions at high temperatures. Similarly, the cellulose lattice can also be penetrated by solvent induced chemical effects that could either be static or dynamic. Solvent statics refers to the de-/stabilization effect on reacting species along reaction coordinates. The free energy of activation is affected, and the reacting species (activated complex) are in equilibrium with the solvent. Here, either the reactant activation does not induce solvent reorganization or reactant activation involves conformational change and is slower than the induced solvent relaxation.11 On the other hand, solvent dynamics refers to the effects due to time-dependent solvent behavior including structural changes that occur in the solvent and the surrounding environment and also non-equilibrium solvation for reactions in a strong dipole or relatively slow relaxing solvent environment.12–14 Reactant activation has significant redistribution of charges and is faster than induced solvent relaxation.2 Glucose isomerization to fructose was more reactive in water compared to methanol because of the slow reorganization of larger and more polarizable methanol molecules during the rate determining hydride transfer step.15 Furthermore, acid catalyzed conversion of fructose to HMF involves various solvent (water) mediated hydride transfer and proton transfer steps. The microkinetic model was able to reproduce experimental kinetics only when the Marcus reorganization energy was included.16
Traditionally, macroscopic approaches have been used to study these solvent effects measured using their impact on reaction rates,17,18 reaction pathways, product selectivity,19–23 and catalyst durability.2,11 Parameters such as viscosity, dielectric constant,24,25 hydrogen (H) bond donor ability, H bond acceptor ability, polarity, polarizability, dipole moment, Lewis acidity and basicity4,26 are regressed against measured effects on reaction rates, product selectivity, or rate constants.27 However, microscopic parameters such as water enrichment in local domains, average H bond lifetime and fraction of the reaction surface occupied by functional groups also play a significant role in altering the relative stability of the reactant, TS, and products.27,28 This favors molecular simulations to derive kinetic and thermodynamic insights into solvent effects that may not be fully explained by macroscopic empirical parametrization that is limited by costly experiments. However, these simulations involve obtaining the energy and forces of atomic configurations achieved by computationally expensive biasing techniques and quantum mechanical (QM) calculations, making them intractable for systems with large numbers of atoms and longer time scales.29 This can be overcome either by coarse-graining atomic calculation methods using empiricism-based classical molecular mechanics, or by using machine learning (ML) to accelerate QM based (ab initio molecular dynamics (AIMD)) simulations.30 The acceleration of QM simulations encompasses: (a) ML for estimating the potential energy surface (PES) based on atomic coordinates, which speeds up MD simulations of chemical structures,31 (b) predictive ML using AIMD simulation data to develop computationally efficient property prediction models of molecular systems,32 and, (c) generative ML using AIMD data to learn probability distributions of molecular representations, given the macroscopic properties for advancing computer-aided molecular design,33 though it is still in its infancy.34 Hence, we proceed to discuss studies from the first two classes of ML models with respect to solvent effects and contextualize this paper and its novelty.
ML to accelerate QM calculations uses neural networks and regression techniques to extract features from the PES before fitting them to predict energy (machine learning potentials (MLPs)) or force fields (machine learning force fields (MLFFs)).32 MLPs as computationally tractable surrogates for ab initio calculations using neural network architectures to capture atomic interactions in Cartesian space via convolutions (SchNet) and physical symmetries via local frame coordinates (DeepPMD) are seen to speed up catalyst screening by efficiently computing reaction activation energies35 and modeling solid–liquid interfaces36 in heterogeneous catalysis. DeepPMD has demonstrated how ML can scale the accuracy of ab initio calculations in surface chemistry, from a system with 1000 atoms to that with 100 million atoms. This is achieved by expressing the total potential energy as a sum in parts of that of the local atomic environments using their extracted symmetrical spatio-temporal features, to assess whether or not solvent molecules dissociate at the interface.37 The high dimensionality of the data, typically 3N atomic coordinates for a system with N atoms, not only limits the use of ML to sample from QM simulations to construct the PES, but also causes the computational cost of reference data from density functional theory (DFT) calculations to scale cubically as the number of electrons in the system.38 This has been surmounted by using time-lagged autoencoders to learn low dimensional manifolds (collective variables) that reproduce the conformational dynamics by maximizing time-lagged autocorrelation within the original space.39
Deploying ML to derive interpretable insights from AIMD simulations by predicting the mechanism, rate and yield of chemical systems as functions of reaction thermodynamic properties has been recognized as one of the six grand challenges of the 21st century.40 Preserving physico-chemical intuition by supplying physically meaningful data representations such as molecular fingerprints or local environment descriptors29 facilitates the ML model to recognize meaningful correlations between the system properties and the features extracted from data. ML regression models trained on fingerprints extracted from MD simulations are shown to predict solvation free energy and partition coefficients that have been experimentally validated.41 However, such frameworks perform poorly when they encounter an atomic configuration not present in the training data. Hence, adaptive ML regression frameworks that query new configurations to retrain ML models on the fly have been shown to result in more reliable predictions.42 Aside from regression models, ML classifiers have been trained to extract features from AIMD data of the decomposition of dioxetane that correspond to either successful or frustrated dissociations.43 The emphasis on retaining physical intuition such as symmetry or translation invariance of atomic configurations in ML models has popularized the use of convolutional neural networks (CNNs) that capture spatio-temporal patterns via parameter sharing, thereby making predictive ML models more efficient.44 It can be seen that the density of water molecules stacked over time (3d grids), or the time-averaged density maps of water (2d grids) when fed into 3d CNNs and 2d CNNs respectively, efficiently capture spatio-temporal variation in water density while predicting the hydration free energy that rationalizes interfacial hydrophobicity in protein folding.45 However, catalytic biomass conversion involves polar aprotic cosolvents in addition to water, the volume ratio of which impacts the rate and yield of the reaction, that is regressed against voxel representations (density of water, the cosolvent and the reactant in discrete volume elements of the simulation box, across 3 separate channels) input to a CNN from MD simulation data to facilitate rapid high throughput screening via a ML surrogate.46 This pre-trained model (SolventNet) has also been used to predict reaction rates in mixed-solvent environments, with just 4 ns of classical MD (force-field-based) trajectories, thereby speeding up the screening of solvent compositions.47
The focus of this manuscript is not to accelerate AIMD directly, but instead to use a predictive ML model to inform simulations about the possible solvent reorganization/reorientation during reactive events. The computational cost involved in generating training data from MD simulations with their associated labels, and the cost of training a ML model itself, is projected to be significantly lower than having to explicitly perform uninformed first principles-based MD simulations on newer systems. Generating labels/targets to train predictive ML models such as reaction rates46 and dissociation time48 involves experiments or indirect sampling calculations from simulation data. In this work, we seek to overcome the cost of assigning labels using experiments or sampling calculations by proposing a self-supervised 3d CNN autoencoder to extract spatio-temporal features of trajectories of solvent molecules around the reactant and product that are supplied from multiple classical/AIMD simulations. The model can screen several new solvent systems by assessing the extent to which the solvent molecules may reorient by using only reactant trajectory simulations. Finally, should first principles simulation be performed on the product profiles of a system found to have lower solvent configuration changes, one could eliminate simulating explicit dynamics of solvent molecules. Reactant/product trajectories were generated for three case studies covering the direct effect of static and dynamic solvent reorganization in cellobiose and in water-DMSO systems, respectively, in addition to the indirect effect of solvent on catalyst–reactant interaction:
1. Glycosidic bond cleavage in ‘liquid cellulose’ during pyrolysis: cellulose kinetics are affected10,49 by the thermal change in the condensed phase.7 The equilibrium trajectories of the cellobiose high temperature melt (solvent) along with the reactant/product are generated using classical MD simulations, in combination with the thermodynamic integration method.10 These are equilibrium solvent trajectories and the corresponding activation barriers are calculated. Therefore, this is an example of solvent statics where the thermal change in hydrogen bonding has a destabilizing effect on the reactant.
2. Glucose to fructose isomerization in water/methanol: hydride transfer involving large asymmetric redistribution of charges is the rate-limiting step of this Lewis/Brønsted acid catalyzed reaction15 comprising non-equilibrium charge transfer, due to which solvent dynamics significantly impact reaction kinetics. Since the solvent is polarizable and the trajectories are generated using Car Parinello MD with metadynamics implementation, the slower solvent relaxation dynamics is captured and the TS ends up being in non-equilibrium solvation.
3. Dehydration of fructose to HMF in water/DMSO: conversion of sugars (glucose and fructose) to furanic compounds (HMF) is widely studied in condensed phase biomass processing.50,51 The solvent alters the interaction between the proton (catalyst) and fructose/HMF (reactant). Reactant/product trajectories are generated using classical MD simulations with metadynamics implementation. In this example the solvent dynamics are captured and the free energy landscape is evaluated as the proton moves towards and away from fructose/HMF.
This paper presents a self-supervised machine learning model by way of using a 3d convolutional neural network (CNN) autoencoder for spatio-temporal feature extraction from the molecular simulation trajectories of the configuration of solvent arrangement around reactant/product systems, the difference between the root mean square deviation (RMSD) of which is fit to a probability distribution via kernel density estimation (KDE) to assess solvent configuration changes. This subsequently informs the development of a Mahalanobis distance-based classifier to predict the extent of solvent reorganization in newer systems by assessing the distance of its reactant features from the distribution of those encoded in systems where the reorganization extent has already been quantified. The rationale behind the choice of this distance-based classifier was to avoid the training costs, as with neural network-based classifiers.52 This framework has the potential to reduce the cost of MD simulations and that of training ML models by predicting the extent of solvent configuration changes, as a basis to inform the consideration of solvent molecules explicitly or not, while simulating the final product configuration.
Fig. 1 The atomic coordinates from MD simulation data are, (a) spatially represented as voxels at each time step, and (b) spatio-temporally represented as time-stacked voxels. |
The x–y–z atomic positions of cellobiose with respect to the simulation box of dimensions Lx × Ly × Lz is represented as a probability density distribution of the atoms existing in a certain discrete volume element of dimensions , where Nx, Ny, and Nz (all chosen to be 20 in the present work) are the number of grid elements that the simulation box is discretized into along each axis, as illustrated in Fig. 1(a). Since the voxels are discrete representations of point clouds of molecules in the simulation box, each of them must be of a size not smaller than the order of magnitude of the atomic radius, to avoid violating the continuum assumption and counterproductively increasing the computational costs of training a machine learning model. At the same time, choosing a voxel size that is larger than the average length-scale of molecular orientation changes for a given solute–solvent system may lead to the desired phenomena not being captured when training the machine learning model. Additionally, the size of the voxel is a hyperparameter that impacts how coarse-grained or fine-grained the data presented to the machine learning model would be, and therein impacts the network architecture of the 3d-CNN autoencoder. The voxel size can be varied independently of the solute–solvent system, with the aforementioned criteria in mind. Each simulation of the reactant and product configurations for the transglycosylation of cellobiose at four different temperatures (100 K, 500 K, 900 K, and 1200 K) has been performed over 8 ns using GROMACS,57 and the coordinate positions have been recorded every 1 ps. The positional voxel density representations of the atomic coordinates are stacked across NT = (100 ps) simulation time steps as shown in Fig. 1(b), to generate T = 80 spatiotemporal tensor samples , from the simulation trajectory of either the reactant or product, for a given system. The molecular reorientation in time is captured by the spatio-temporal convolution across time-stacked voxels, without the need to rearrange system coordinates or fix the center of mass as a reference.
For the total number of N (=2T × number of systems modeled) input tensor samples from both the reactant and product trajectories of all systems, X(i) for i = {1, 2,⋯N}, a 3d CNN autoencoder is trained as a hierarchical model that uses a sequence of convolutional, activation, pooling, flattening and fully connected layers in the encoder (E), before symmetrically unrolling the sequence in the decoder (D) to reconstruct the input as (i) = fD(fE(X(i),θE),θD). The parameters of the encoder and decoder functions (θ = {θE, θD}) of the self-supervised autoencoder network are learned by minimizing the following loss function:
(1) |
The number of parameters in the 3d CNN autoencoder scales with the choice of hyperparameters that govern the network architecture in the hierarchy of operations. The convolutional operation given by comprises a 3d kernel of dimensions (ncx × ncy × ncz) that performs convolutions across a stride of s voxels in each dimension over the NT time slices to produce q feature maps in the output , as given by the following equation, where x1 ∈ {1, 2,…, nϕx}, x2 ∈ {1, 2,…, nϕy}, x3 ∈ {1, 2,…, nϕz}, j ∈ {1, 2⋯, q} and is the bias term.
(2) |
(3) |
The dimensions of the output (nϕ) across any specific axis are impacted by the respective kernel dimension (nu), stride (s) and padding (P), if any, given an input of size N, as indicated by eqn (3). The purpose of padding is to preserve the input dimensions in the convolved output.58 However, since the convolutional operation is used to down sample the inputs for feature extraction, zero padding has been used in this work. The convolved features are then passed through a nonlinear activation function that does not modify the dimensions.
f(ϕ) = max (0,ϕ) | (4) |
v′ = f(Wfcv + bfc) | (5) |
As compared to activation functions such as the tanh and sigmoid, the rectified linear unit, given by eqn (4), is preferred as it does not suffer from gradient saturation in the event of large magnitude inputs, thereby increasing the sensitivity of the model to input representations.59 Following this, the pooling operation is used to down sample the activated output, to make the encoded representations invariant to minor translations in the input,60 resulting in a pooled output . This follows the same lines as eqn (2) and (3), except that there is no bias translation and the 3d max pooling kernel of dimensions (npx × npy × npz) merely outputs a maximum valued scalar as it strides over s voxels at a time along the axes, for all the input feature maps. Several units comprising the aforementioned convolutional, activation and pooling operations can be hierarchically stacked to transform the input sample X(i) into a tensor , of p feature maps, before finally flattening it to result in a vector that is fed into a fully connected layer to result in an output feature vector , given in eqn (5), where and are the weights and biases, parametrizing the fully connected layer, respectively. There can be many such fully connected layers as indicated by the schematic in Fig. 2, to finally obtain f′ latent features in the bottleneck layer of the encoder. The structure of the decoder is seen to mirror that of the encoder in reconstructing the input from the features of the bottleneck layer via a series of upsampling operations such as deconvolution and unpooling. If theconvolutional operation is expressed as the multiplication of the Toeplitz block of strided kernel coefficients with the input, then the deconvolution can be expressed as its inverse, where upsampling is achieved by multiplication with the transpose Toeplitz block.61 Similarly, unpooling is performed by inserting the maximum values into their index positions, cached during the pooling operation.
Once trained, the bottleneck layer of the encoder is used to extract latent features from the MD trajectories of the reactant that are plugged into the quadratic distance-based classifier, to predict whether or not the configurations of solvent molecules change significantly in the product profile. This is based on the key assumption that samples with the same label should have similar latent features extracted by the autoencoder.62 However, developing a classifier to discriminate between latent features is supervised, in that there is a requirement for ground truth labels, the generation of which is expensive and time consuming.63 This is surmounted by calculating the root mean square deviation between features of the product and reactant trajectory for a system, followed by using KDE to probabilistically assess systems with a higher extent of reorganization using a threshold, based on which labels are assigned to the features extracted from the reactant trajectory samples to develop the Mahalanobis classifier, a choice deliberately made to also eliminate the cost of training neural network classifiers.52
(6) |
(7) |
(8) |
For a particular system, the , pointing to the deviation of features of the product trajectory samples from the time average of features across the reactant trajectory samples, is given in eqn (6). The probability distribution of x ∈ [min (RMSD), max (RMSD)], for each of the systems is fit using kernel density estimation (eqn (7)), where the choice of the kernel function (K) and bandwidth (b) are guided by grid search optimization.64 The systems with the least and highest probabilistic mode of RMSD from the distributions are designated labels, y = {0, 1}, corresponding to a large and small extent of solvent reorganization, respectively. The posterior probability of the other systems being labelled 1, given their RMSDs and the assumption of equally likely priors, is determined using eqn (8).65 These encoded reactant and product trajectories in question are calculated for the cellobiose system, which is also the system on which the 3d CNN autoencoder is trained to learn the feature encodings. Hence, there are no assigned labels for the cellobiose systems until we proceed to look at the distribution describing the RMSD (eqn (7)), based on which the posterior probabilities of label assignment are computed using eqn (8). If the posterior probability of the cellobiose system exceeds a certain threshold, all the features corresponding to the samples in the reaction trajectory are designated a label 1, else 0, giving rise to labeled samples {fE(X(i)reactant),y(i)} for all i ∈ {1, 2,⋯N} supplied as reference data to a quadratic classifier based on the Mahalanobis distance. The classifier is trained to detect solvent reorganization patterns from a reduced set of encoded features of the reactant trajectories in a kernel space for a new system, by assessing their closeness to positively labelled trajectories of the cellobiose system, using the mean and covariances of their kernelized features as follows:
(9) |
The loss function of the 3d CNN autoencoder is minimized by stochastic gradient descent, implemented using the Adam optimizer on PyTorch with a learning rate of 10−3. The process of gradient descent involves computing the gradient of the loss function with respect to the weights of the layers and is efficiently performed via the backpropagation algorithm. The distance-based classifier is then implemented to assess the extent of solvent reorganization in newer systems by measuring the distance between their features and the distribution of features extracted from the systems labelled as 1 (where larger RMSDs are more probable).
The saliency of the highest magnitude latent feature obtained by using the encoder to project the time aggregate of the data voxels from the reactant trajectory for a given system is calculated from eqn (10). The closeness of the encoded features of the reactant trajectory of a new system, to that of the low temperature cellobiose systems, is used as a basis to assess solvent configuration changes in this manuscript. Hence, saliency maps are an important tool to validate if the latent features of the reactant trajectory for the cellobiose systems duly sensitize the region of the simulation box containing the solvent molecules.
(10) |
Fig. 3 Glycosidic bond cleavage in the pyrolytic decomposition of cellobiose.66 |
The voxels of atomic densities stacked over every 100 ps result in 80 samples each, from the reactant and product profiles, at each finite temperature simulation of cellobiose. This leads to a total of 640 temporal voxels of data samples across all 4 temperatures that are used to train the 3d CNN, with 80% used for training while the remaining 20% is used as a validation set for early stopping. A 3d CNN autoencoder with a structure as given in Fig. 2 is trained on these samples to extract encoded feature representations in the bottleneck layer. The root mean square deviation (RMSD) between these features of the samples in the product profiles and the average of the encoded features across all samples in the reactant profiles are indicated in Fig. 4(a)–(d), across the four temperatures. Since the encoded features are extracted from equilibrium simulations of the cellobiose systems, it is permissible to average the encoded features across all samples in the reactant trajectory, as a reference against which the encoded features of the product trajectory are compared, when defining the RMSD. The use of RMSD as a descriptor of the extent of solvent reorganization circumvents the need for expensive sampling strategies45 to compute metrics from the simulation data. It can be seen from Fig. 4(c) and (d) that solvent configuration changes are less prominent at higher temperatures. A kernel density estimation is used to quantify the probability distributions of the RMSDs in accordance with eqn (7) using a Gaussian kernel (K). The bandwidth chosen by grid search cross validation is found to be optimal at 0.1 and results in cumulative density distributions given in Fig. 4(e), from which the cellobiose 100 K system and the cellobiose 900 K system are seen to have the most (class 1) and least probable (class 0) solvent configuration changes, respectively. The density distributions of the cellobiose 100 K and 900 K systems are then used to quantify the posterior probability of a system being recognized as significantly reorganized, conditioned on its RMSD, as outlined in eqn (8). The average of the posterior probabilities across all the cellobiose systems is then used as a threshold as shown in Fig. 5, to recognize if significant solvent reorganization has been observed in a system or not. This threshold is used to assign labels only to the cellobiose systems, on the basis of which solvent configuration changes in newer test systems will later be assessed.
A probability distribution is fit to the RMSD fluctuations at each temperature shown in Fig. 4(a)–(d) using kernel density estimation (KDE). A kernel can be understood as a function describing a smooth symmetric curve characterized by a shape factor, for instance, a Gaussian bell curve is a kernel, and its shape factor is the standard deviation that characterizes the spread of the probability distribution that the curve is an estimate of. In KDE, each data point is modeled using a kernel, and the contribution to the estimated probability density at a certain RMSD is the weighted contribution from the kernels fit to all datapoints within a window of that specific RMSD value. Hence, KDE helps in estimating the probability distribution of data when the true underlying distribution is unknown. The cumulative density function from the probability distributions fit to the RMSDs is shown in Fig. 4(e). The solvent configuration changes are then quantified on the basis of the RMSD distributions. The posterior probability of a system with large reorganization calculated from the RMSD distribution is presented in Fig. 5. At 100 K and 500 K, there is a higher probability for temperature-induced orientation of the condensed cellobiose molecules (i.e. the melt-phase analogous to a solvent). In the earlier MD work published by the authors,10 free energy barriers for glycosidic bond cleavage in cellobiose were calculated. These reaction barriers include the enthalpic activation of the condensed phase owing to static solvent reorganization and the finite temperature entropic contributions. The trend of the probability for solvent reorganization calculated here using the 3d convolutional neural network autoencoder seems to be consistent with trends in the Gibbs free energy barrier for the transglycosylation mechanism of the reacting cellobiose molecules in the melt phase across the four different temperatures, as given by Fig. 5. The free energy barrier (FEB) decreases almost linearly with increasing temperatures and asymptotes above 900 K, at a constant value of ∼105 kJ mol−1. The reduction in the FEB going from 100 K to 900 K is 267.76 kJ mol−1 and suggests a strong impact of the temperature of the cellobiose melt environment on the glycosidic bond cleavage. The slope and the y-intercept of the free energy vs. temperature plot give the entropic and enthalpic contributions, respectively. The constant slope of the FEB curve at low temperature is indicative of the constant gain in entropy for the decomposition of the cellobiose melt to LGA. At higher temperatures the FEB flattens indicating that the entropic contribution to the barrier is zero, making it an enthalpy-controlled regime. Hence, the linear slopes of the low temperature and the high temperature curves form distinct decomposition regimes, also measured using millisecond scale kinetics experiments.71 The temperature-induced conformational changes in the molecules going from the crystal to the melt-phase influence cellulose chemistry. At low temperatures, cellulose exists in its crystalline state with an ordered intermolecular H bonding network which makes it energy intensive for the twisting and breaking of cellulose chains. At higher temperatures, the cellulose matrix expands taking a low-density state and disrupts the ordered H bonding network. As seen in Fig. 5, the free energy barrier is high which decreases with temperatures to eventually settle above 900 K. This is in coherence with the condensed phase reorganization probability predictions. At 100 K/500 K, the probability for reorganization is high as the neighboring molecules are close by and they would have to reorient breaking other H bonds to accommodate the reaction, whereas, at 900 K/1200 K, the temperature-induced shift in crystallinity pushes the neighboring molecules away cleaving H bonds and reducing the extent of their involvement in the reaction. Therefore, the free-energy barrier calculated using molecular modelling, supports and validates the qualitative trends in the RMSD probability distributions when it comes to using ML to decipher the extents of melt configuration changes in the decomposition of cellobiose across different temperatures (Fig. 5), from which it can be concluded that the barrier for glycosidic cleavage in the condensed phase is heavily influenced by the reorganization energy of the solvent.
The ratio of probabilities of an observed RMSD (change in molecular configurations in the product trajectory from the average of the configuration changes in the reactant profiles), evaluated from the kernel density estimates of the two cellobiose systems at the extremes of melt configuration changes (100 K and 900 K), is called the posterior (eqn (8)). It quantifies the probability of a cellobiose system having a similar extent of solvent/condensed phase reorganization in the product (after the conformational changes have equilibrated) as compared to the reactant, at a temperature of 100 K. Its trend is similar to the activation free energy barrier with temperature, indicating that the reorganization of the melt-phase from the reactant to the product governs the reaction kinetics. At lower temperatures, the cellobiose melt is more ordered and interacts with the reactant cellobiose via H bonds and hence reorients largely to accommodate the structural changes when the reactant goes to its TS. However, at high temperatures the condensed phase matrix is already broken and its density is lower, because of which the melt-phase molecules need not reorient significantly between the reactant and its TS. Hence, prominent reorganization of solvent molecules at lower temperature results in larger entropy differences, as compared to higher temperatures.
Thus far, the predictions from the ML model have been rationalized by the physics of the reactive systems. However, it is difficult to physically interpret the bottleneck features of the 3d CNN autoencoder that inherently capture signatures mapping to solvent configuration changes. Explainability in CNNs that capture spatial information has been demonstrated using saliency maps that comprise gradient information of the loss function with respect to the input data and hence hold information about regions of the input space that are sensitized while minimizing a certain loss, in making target predictions for supervised machine learning tasks.72 Since we developed a 3d-CNN autoencoder as a self-supervised ML model in this work, the saliency maps have been analyzed as gradients of the strongest activated neuron in the bottleneck layer with respect to the averaged input data voxels. Inspection of the time-averaged saliency maps reveals that the 3d-CNN autoencoder is sensitized by a locally centered domain of the simulation voxel (Fig. 6(a)) of the trajectories of the solvent molecules around the reactant cellobiose, whose 2d contours are also shown in Fig. 6(b). The activated intensities of the 3d-CNN in the central region of the simulation box do not overlap with the reactant cellobiose (Fig. 6(c) and (d)), but coincide with the regions occupied by the condensed phase (Fig. 6(e) and (f)). It can be deduced that the 3d-CNN autoencoder is sensitized by the melt-phase configuration changes in the bulk, rather than in the vicinity of the reactant because the saliency maps corresponding to the strongly activated feature in the bottleneck layer that is used as a basis for distance-based classification are seen to build on the spatio-temporal features extracted from the Cartesian coordinates of atoms in the condensed melt-phase (solvent). Ultimately, the reorganization of the bulk melt-phase impacts the temperature-induced conformational changes in the reactant cellobiose, and it has been validated that the ML model captures the same by means of the feature saliency maps.
Simulations for hydride transfer in glucose isomerization were performed using two different solvents viz., water and methanol and have been recorded over 33000 and 45000 simulation time steps, respectively of 0.0964 fs each. The solvation dynamics of these non-equilibrium simulations indicate an increase in the carbon–hydrogen bond length due to hydride transfer, commencing after ∼8000 and 10000 time steps, for the water and methanol systems, respectively. For the purpose of prediction using our trained ML model, these initial time steps will be treated as the reactant configurations.15 A voxel sample over every 100 time steps i.e. 9.64 fs would lead to 80 and 100 samples, for the reactant trajectories in water and methanol, respectively. The spatio-temporal 3d CNN provides encoded latent features for these samples from the reactant trajectories. When we are interested in quantifying the overall solvent configuration changes of the system as a whole using these simulations, it must be noted that even highly reoriented systems may have points in their trajectories where configurations do not change much and vice versa for lower reoriented systems. Therefore, the average of the Mahalanobis distance of each of the sample encoded features, from the distributions of features of the low temperature cellobiose systems at 100 K and 500 K, is used to assess solvent configuration changes, as shown in Fig. 7(a). This indicates that methanol reorients to a greater extent than water during the hydride transfer reaction in glucose isomerization. This is substantiated by the fact that a change in the charge structure of glucose during the hydride shift has the tendency to polarize methanol to a larger extent than water because of which its activation free energy barrier (ΔG (kJ mol−1)) for the hydride transfer is 50 kJ mol−1 higher15 (Fig. 7(a)). The large electronic polarization of methanol results in its significant reorientation during the hydride transfer and much slower relaxation dynamics resulting in the non-equilibrium solvation of the transition state, while water undergoes lower electronic polarization and subsequently reorients to a lesser extent because of which its relaxation dynamics is near equilibrium solvation in going from the transition state to the product, both of which are better solvated by water and result in lower activation energy of hydride transfer.
For the fructose systems, MD simulations have been carried out at different solvent compositions and have been recorded every 200 time steps for a duration of 10 ns. Hence, a voxel sample over every 100 ps would lead to 100 samples from each of these trajectories. The average of the Mahalanobis distance of each of these 100 fructose features from the distributions of features of the cellobiose systems at 100 K and 500 K is shown in Fig. 7(b). The results are seen to concur with the trends of the relative stability of hydronium ions at different DMSO concentrations given by the difference (ΔG) in the free energy surface (FES) minimum corresponding to the hydronium ion in the first solvation shell of fructose and that of the FES minimum corresponding to the hydronium ion in the bulk solvent,53 as shown in Fig. 7(b). The increase in ΔG when DMSO goes from 0 to 5% wt can be attributed to the instability of DMSO molecules in the bulk solvent, generating a rich local domain of DMSO molecules near fructose while the water (solvent) molecules in the bulk stabilize the hydronium ion. However, as the DMSO concentrations increase from 5 to 80% wt a clear descending trend is observed in ΔG and the average distance from the strongly reoriented cellobiose systems (class 1), suggesting that the relative stability of hydronium ions in the first solvation shell of fructose increases. Brønsted acid catalysis of biomass components in water and co-solvent had a higher proportion of water in the first solvation shell compared to the bulk. Such water enriched local domains are formed with an increase in the co-solvent concentration around the hydroxyl groups of reactants, promoting proton transfer,18,73–77 making the reaction highly dependent on the water concentration in the first solvation shell. Aprotic solvents such as DMSO have been reported to drastically increase the HMF selectivity and fructose conversion78–82 owing to the furanose conformation83 that stabilizes the TS84,85 and induces hydroxyl (fructose) interactions with water.3 Moreover, with an increase in the DMSO concentration acid-catalysis is enhanced as the relative stability of the proton near fructose (reactant) and HMF (product), increases and decreases, respectively.77 In polar aprotic solvents, the free energy of H+ solvation decreases by 24 kJ mol−1 compared to water.11 Recombination of protons was promoted, leading to decreased dehydration rates suggesting that polar aprotic solvents significantly destabilize the proton. In the case of glucose dehydration to HMF, glucose hydroxyl groups and solvents compete for the proton. Addition of DMSO as co-solvent improved the conversion, because of its lower affinity for protons.86,87
It must be noted that the free energy barriers indicate different aspects in both of the systems discussed above. Hence, an identical trend in the free energy barrier is not observed across both the systems as they get closer in distance to the low temperature cellobiose systems where the melt phase is found to reorganize significantly. In the water and methanol systems, the charge distribution in the reactant glucose due to the hydride transfer polarizes the solvent molecules, and leads to higher reaction energy barrier in more polar solvents such as methanol that reorient to a greater extent by virtue of the reaction, therefore found to lie at a closer distance to the highly reorganizing lower temperature melt-phase cellobiose systems. However, in fructose systems, the relative concentrations of the protic solvent (water) and the aprotic co-solvent (DMSO) govern the instability of the hydronium ion in the bulk, thereby providing a driving force for it to migrate towards the first solvation shell of fructose in the presence of large concentrations of aprotic DMSO. The free energy difference in this case is a measure of the thermodynamic instability in the first solvation shell as opposed to the bulk. Hence, lower free energy corresponds to a decline in the relative instability, i.e., an increase in stability of the first solvation shell, prompting solvent reorientation that subsequently facilitates the reactive transformation of fructose into HMF, and is consistent with the decrease in distance from the lower temperature cellobiose systems (denoted as class 1 in ML classifier terminology).
A 3d-CNN autoencoder for spatio-temporal feature extraction is trained on both the simulation trajectories of the solvent molecules surrounding the reactant and product molecules in the condensed phase transglycosylation reaction of cellobiose at different temperatures. The probability distributions across the RMSD of the features between the product and the reactant profiles are seen to show a higher probability of solvent configuration changes for the lower temperature finite simulations at 100 K and 500 K. These findings are consistent with the linear decrease in the free energy barrier with increasing temperatures, supporting that the reaction kinetics is impacted by the reorganization of the melt-phase from the reactant to the product. To assess the extent of solvent configuration changes in newer systems, a quadratic classifier based on the Mahalanobis distance metric is then used to calculate the average distance between features from the trajectory of the solvent molecules around the reactant in test systems; and the distribution of features from the reactant trajectory of the strongly reorganizing low temperature cellobiose systems (100 K and 500 K). The ML model assesses methanol to reorient to a greater extent than water and is consistent with the prior findings in the literature where the changes in the charge structure due to hydride transfer in glucose have a tendency to polarize methanol to a greater extent than water, because of which the reaction activation free energy is ∼50 kJ mol−1 higher in methanol. For the fructose dehydration systems, the average Mahalanobis distance from the cellobiose systems at 100 K and 500 K is seen to increase at first and then decrease almost linearly with increasing concentrations of DMSO, consistent with the trends in the difference between the free energy surface minima that point to a larger impact of solvent configuration changes on reaction kinetics with increasing DMSO concentrations. It has been demonstrated that the ML framework can generalize well when it comes to predicting the extent of solvent reorganization across different systems by using their reactant trajectory simulations. Hence, it can be used to limit computational efforts when simulating product trajectories in systems where solvent configuration changes are found to be lower, by eliminating the dynamics of the said molecules and/or eliminating the necessity for an explicit condensed phase environment. Consequently, if a library of solvents is to be screened for their suitability in chemical reactions using molecular simulations, one could assess whether or not to explicitly simulate the solvent molecules with the aid of the ML framework presented in this paper.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00049h |
This journal is © The Royal Society of Chemistry 2024 |