Sukriti
Singh
*a and
Raghavan B.
Sunoj
*ab
aDepartment of Chemistry, Indian Institute of Technology Bombay, Mumbai 400076, India. E-mail: sukriti243@gmail.com; sunoj@chem.iitb.ac.in
bCentre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai 400076, India
First published on 11th April 2022
Minimizing the time and material investments in discovering molecular catalysis would be immensely beneficial. Given the high contemporary importance of homogeneous catalysis in general, and asymmetric catalysis in particular, makes them the most compelling systems for leveraging the power of machine learning (ML). We see an overarching connection between the powerful ML tools such as the transfer learning (TL) used in natural language processing (NLP) and the chemical space, when the latter is described using the SMILES strings conducive for representation learning. We developed a TL protocol, trained on 1 million molecules first, and exploited its ability for accurate predictions of the yield and enantiomeric excess for three diverse reaction classes, encompassing over 5000 transition metal- and organo-catalytic reactions. The TL predicted yields in the Pd-catalyzed Buchwald–Hartwig cross-coupling reaction offered the highest accuracy, with an impressive RMSE of 4.89 implying that 97% of the predicted yields were within 10 units of the actual experimental value. In the case of catalytic asymmetric reactions, such as the enantioselective N,S-acetal formation and asymmetric hydrogenation, RMSEs of 8.65 and 8.38 could be obtained respectively, with the predicted enantioselectivities (%ee) within 10 units of its true value in ∼90% of the time. The method is highly time-economic as the workflow bypasses collecting the molecular descriptors and hence of direct implication to high throughput discovery of catalytic transformations.
From a sustainability standpoint, it is high time that we endeavor to develop faster, reliable, and less resource intensive (e.g., time and material) invention workflows. To make this goal more realistic, ML driven protocols have a highly promising role to play.16,17 The yield and enantiomeric excess are countable indicators of how good a (asymmetric)catalytic method is, particularly in its developmental stage. One would inevitably encounter a high-dimensional chemical space composed of relevant molecular features of catalysts, substrates, solvent, additives, etc., for training ML models to predict the yields/ee of catalytic reactions.18,19 For instance, Rothenberg et al. built a classification and regression model for predicting the turnover number (TON) and turnover frequency (TOF) for 412 Heck reactions.20 A total of 74 physical organic descriptors were employed for the reaction components such as the substrate, ligand, solvent and so on in addition to the inclusion of reaction conditions (time, temperature, catalyst loading, etc.). The artificial neural networks were found to perform better than the linear regression techniques. The trained model was then utilized to predict the TON and TOF of a virtual library (in silico) of 60000 Heck reactions. In the recent years, there has been a visible increase in efforts in developing new molecular representations capable of improved performance and generalizability. Several approaches, other than those relying on quantum chemically derived molecular descriptors, have emerged. These methods primarily involve the use of various genres of structure-based representations.21–23
The representation learning methods such as the deep neural networks (DNNs) built on engineered features have found profound applications in chemical space.24,25 DNNs can also be trained on molecular representations, such as the SMILES (simplified molecular input line entry system) strings, to learn the feature representation involving minimal feature engineering.26 This approach can grasp the underlying patterns in atomic connectivity and capture the relationship between such features and molecular properties. During the developmental phase in catalysis research, only smaller or fragmented datasets are typically available, thereby necessitating a work-around, should one choose to deploy DNNs. It may be possible that tools from seemingly disparate domains become valuable for a task at hand, provided that shared or latent characteristics exist between them.
Natural language processing (NLP) is one of the most visible domains of artificial intelligence that provides computers the capability to generate and analyze text/speech.27 The large/labeled data requirements in NLP could be circumvented by using transfer learning methods,28 wherein the knowledge acquired from one task (source task) is retained and subsequently utilized for other related tasks (target task). Therefore, NLP could be deployed for those tasks that rely on the language or similar textual data. The SMILES representation of molecules can be considered analogous to a natural language. In fact, interesting applications of NLP-based methods to chemical reactions are now becoming available.29–31 The use of NLP-based models for accurate prediction of various properties of molecules is well-known.32,33 On the other hand, predicting the reaction outcome that is known to depend on the molecular attributes of catalysts, reactants, solvents and several other factors is challenging and has seldom been reported using language models.
Currently, most of the ML models for ee or yield predictions are custom-made for specific reactions, limiting their direct transferability to another reaction class. Further, such models are built on atomic/molecular descriptors as obtained through workflows involving time-consuming quantum chemical computations on a large library of molecules. We envisaged that NLP methods in conjunction with the SMILES representation of the molecular space could offer learning tools suitable for chemical catalysis. Such approaches could be transferable and time-economic. Herein, we design ML models that can predict both ee and yield of catalytic reactions. To demonstrate our ML protocol, a repertoire of transition metal-as well as organo-catalytic transformations of high recent interest, such as the (1) Buchwald–Hartwig reaction,34 (2) enantioselective N,S-acetal formation,35 and (3) asymmetric hydrogenation of alkenes/imines36 are chosen (Fig. 1a) (ESI Tables S1, S15, and S26†). It of importance to note that these reactions are of high contemporary importance owing primarily to their practical utility. For instance, reaction-1 finds broad applicability in pharmaceutical synthesis.37 The BINOL-derived chiral phosphoric acid catalysts, as employed in reaction-2, are valuable in a number of synthetically useful reactions.38 Reaction-3 and the catalyst families therein are employed in enantioselective synthesis of pharmaceuticals and agrochemicals.39 The trained ML model comprising these scaffolds would therefore be beneficial for reaction development. The availability of performance indices of various ML models on these reactions makes them the most suitable candidates as it enables us to place our NLP based transfer learning results vis-à-vis the reported benchmarks. In addition, the data distribution shown in Fig. 1b is representative of several pragmatic scenarios encompassing a (a) good number of samples (reaction-1) and their balanced distribution (reactions 1 and 2), or (b) fewer samples and unequal distribution (reaction-3) between the higher and lower ee/yield regions. Therefore, we believe that a unified TL model, applicable to all these types of data distributions, would find superior utility.
Fig. 1 (a) The choice of reactions. (b) Distribution of the samples across various yield/ee regimes. |
(1) General domain language model (LM) pre-training: given a sequence, a trained LM can predict the probability of the next word. In the context of SMILES, a chemical LM is trained to predict the next character in a sequence of SMILES strings. To efficiently learn the characteristics of the SMILES representation, a large amount of data is required. For this purpose, we have pre-trained a general domain chemical LM using one million molecules from the ChEMBL database.40,41 This is known as the general domain LM as it is trained on a large set of diverse data to acquire a desirable level of learned representation that carries semantic or structural meanings of SMILES, beneficial for other downstream tasks. Different strategies are available for SMILES pre-training.42–44 For instance, in the ULMFiT method, the prediction of the next character in the SMILES string is the key task. In masked language modeling, some of the input tokens are masked, and the model is trained to predict such masked tokens.43 In the SMILES equivalence approach, given two SMILES strings, the task is to predict if they represent the same molecule.42 The generative strategy is also used for pre-training, wherein from a given input SMILES representation, the model is trained to generate valid and equivalent SMILES.44
(2) Target task LM fine-tuning: in the target task, we have chosen three reactions (Fig. 1a), which form a labeled dataset with reaction yield or ee as the output. Following the standard protocol in transfer learning, the knowledge acquired in the previous pre-training step is utilized for the target task. Consequently, the LM is fine-tuned on the target-task data to learn and predict the next character in a SMILES sequence. The key difference from the previously trained general domain LM is that the model has now learned the task-specific features of the SMILES language.
(3) Target task regressor: since the goal of our machine learning model is to predict the yield/ee of the reaction of interest, in the third step, the pre-trained or fine-tuned LM is accordingly expanded for the required regression task.
Fine-tuning the target regressor is crucial to transfer learning as an aggressive fine-tuning might even nullify the benefits of a trained LM. We examined the effect of two approaches in fine-tuning, depending on how the weights in the hidden layers are handled. In the first approach for fine-tuning, the model initialization with the pre-trained (or fine-tuned) weights and training the full model is done at once. In other words, the method employing a fixed learning rate and without frozen weights constitutes the first protocol. Another technique involving gradual unfreezing is also used for fine-tuning. In gradual unfreezing, we start with frozen weights first, and the layers are unfrozen step-by-step during training, and this process is repeated until the entire model is unfrozen and fine-tuned. The results presented in the manuscript are obtained by using the first protocol of fine-tuning without gradual unfreezing. However, the performance comparison employing both of these fine-tuning methods is also done (ESI sections 2 and 11†).
Fig. 2 Illustration of the conversion of a representative sample from its SMILES strings to a machine-readable input, and a schematic diagram of the model architecture. More details on the model architecture are described in section 1.2† of the ESI. |
Exact splits are not readily available for all the reactions considered here. In the earlier report on reaction-1, only one 70:30 train-test split was used, while for reaction-2, ten different 600:475 train-test splits were used. In the case of reaction-3, hundred different 80:20 train-test splits were used. To ensure uniformity across all three reactions considered in this study, we have employed ten different 80:20 train-test splits. The full set of samples of a given reaction type was randomly split into a 70:10:20 train-validation-test set. All the hyperparameter tuning was performed on the validation set, and the best set of hyperparameters thus obtained was used for prediction on the test set (ESI sections 12 and 13†).
In the present study, we endeavor to combine SMILES and ULMFiT to build a unified model that can predict (a) enantiomeric excess (%ee) in asymmetric catalysis and (b) yield (expressed in % of product formed) of catalytic reactions. Here, the source task is the chemical language modeling, and the target task is a regression problem to predict %ee or %yield (ESI Fig. S4†). The potential of this concept is evaluated on the reactions shown in Fig. 1a. The reported root mean square errors (i.e., RMSEs with respect to the corresponding experimental values) for reactions 1 and 3 were 7.8% and 8.4%, respectively, with the best performing RF algorithm built on quantum mechanically derived descriptors.34,36 In the case of reaction-2, the support vector machine (SVM) gave a mean absolute error (MAE, in terms of (ΔG‡R − ΔG‡S)) of 0.15 ± 0.005 kcal mol−1.35
We have pre-trained a language model with the ULMFiT architecture on the SMILES strings derived from 1 million molecules (ESI section 1†). This model is henceforth denoted as TL-m1 (i.e., with pre-training, Fig. 3a) and is used for predicting the %yield and %ee. The model performance, estimated in terms of the RMSE (in %yield or %ee) averaged over 30 independent runs, is provided in Fig. 3b (ESI sections 5.8, 6.6, and 7.6†). In this approach, we noticed that varying levels of SMILES augmentation, in the range of 25 (for reaction-1) to 100 (reaction-3), assures improved performance (ESI section 8†).50 It is discernible from the summary of results that the transfer learning returns better predictions. For instance, the most significant improvement is noted for reaction-1 with an RMSE of 4.89 ± 0.33 for our 80:20 train-test approach (ESI Table S13†). With a split of 70:30, we could obtain a performance of 5.11 ± 0.47 as compared to the earlier benchmark of 7.8. However, for reaction-3, the TL-m1 yielded a very similar result to that of the reported RF model (ESI Tables S30 and S35†). Similar RMSEs of 8.65 ± 0.80 and 8.38 ± 1.40 are noted respectively for reactions 2 and 3 (ESI Tables S19 and S25†).
At this juncture, in keeping with the standard practice in NLP, we wondered whether LM fine-tuning on the target-task might become beneficial (TL-m2, i.e., with fine-tuning, Fig. 3a). For all the three reactions, no improvement in the predictive capability of TL-m2 as compared to TL-m1 is noticed (Fig. 3b). The advantage of TL models can be evaluated by comparing the test performance of the ULMFiT architecture devoid of pre-training (TL-m1) and fine-tuning (TL-m2). An ablation study without using TL is carried out to evaluate the effect of transfer learning. Such a model, denoted as TL-m0 (i.e., with no TL, Fig. 3a), is found to be inferior to TL-m1/m2. In addition, the TL architecture can be used across different reaction classes (ESI sections 5.2, 5.4, 6.2, 6.4, 7.2 and 7.4†).
As discussed in the previous section, a second protocol involving the fine-tuning of the regressor using gradual unfreezing is also investigated. In this case, it is found that the TL is of no benefit to reaction-1, as can be seen from the corresponding performance metric of TL-m1 (6.02 ± 0.29), TL-m2 (6.69 ± 0.27), and TL-m0 (5.84 ± 0.49) (ESI sections 5.3 and 5.4†). However, in the case of reaction-3, TL is found to be more effective as evident from the performance of TL-m1 (8.56 ± 1.46), TL-m2 (8.61 ± 1.34), and TL-m0 (10.67 ± 2.54) models (ESI sections 7.3 and 7.4†). Similar results are obtained for reaction-2 as well, with TL-m1 (8.88 ± 0.96), TL-m2 (9.11 ± 1.15), and TL-m0 (11.83 ± 1.75) (ESI sections 6.3 and 6.4†). On comparing the results of fine-tuning the regressor, with or without gradual unfreezing, some interesting observations could be made. The largest performance boost is obtained for reaction-1 where the test RMSE for TL-m1/m2 reduced from 6.02 ± 0.29/6.69 ± 0.27 to 4.89 ± 0.33/5.27 ± 0.34 upon removing the gradual unfreezing. For reactions 2 and 3, no significant performance change is noted (ESI section 15†).
We noticed that there is no significant improvement with TL-m2 where the model is fine-tuned to a task-specific dataset (Fig. 3b). The fine-tuning is a delicate process as it may affect how much information learned from pre-training is retained. One of the ways to maintain this balance is the use of gradual unfreezing while fine-tuning. Therefore, fine-tuning using gradual unfreezing is also attempted to check whether the additional information with fine-tuning makes a notable difference. We noticed that the trend is same across all the three reactions when the model is fine-tuned with or without gradual unfreezing (ESI Table S43†).
Several intriguing aspects emerged through a comparison between the predictive capabilities for each reaction class and the nature of the corresponding data distribution as given in Fig. 3b and 1b. For instance, the best performance among all the reaction classes could be obtained with TL-m1 for reaction-1, which has rich and uniformly distributed output values. On the other hand, the most difficult system to predict in Fig. 4a is reaction-3, where both TL-m1/m2 results turned out to be very similar to the previously established RF performance. This can be attributed to low and sparse data along with class imbalance where the data are clustered around a high ee region. This may lead to a large deviation in the prediction of high ee samples with similar features. Another general strength of our TL-m1 can be gleaned on the basis of the overall quality of predictions of %yield and %ee; across all three reactions consisting of 1107 predictions in a typical test run (Fig. 3c) as well as for the individual reaction class (Fig. 3d). Of these, 95% of predictions were found to be within 10 units of the actual experimental value.
Fig. 4 (a) The plots of predicted versus observed yields/ee for all three reactions. (b) Comparison of performance of different machine learning models for the yield predictions on the Buchwald–Hartwig cross-coupling dataset (reaction-1). The following acronyms are used; ml-QM-GNN: Quantum Mechanical descriptors and Graph Neural Network; DFT: Density Functional Theory; MFF: Multiple Fingerprint Features; BERT: Bidirectional Encoder Representations from Transformers; ULMFiT: Universal Language Model Fine-Tuning. It should be noted that the R2 values used here are exclusively for reaction-1 to enable a direct comparison with the previous reports, as it is the only performance metric available for this reaction across different ML protocols (ref. 21–23). |
To further analyze the model performance, we have plotted the predicted versus observed values for the best performing test set as a representative example for all the three reactions (Fig. 4a). The plot for reaction-1 is found to be most impressive with an R2 of 0.97 corresponding to an RMSE of 4.3. In the case of reaction-3, the data are clustered around the high ee region (Fig. 1b). Therefore, R2 may not be a good metric of the reaction performance. Nevertheless, the plot for reaction-3 is provided in Fig. 4a, with an R2 of 0.80 that corresponds to an RMSE of 7.0. In addition, we believe that the utility of TL-m0 should equally be acknowledged as it offers the most time economic performance for problems with uniform distribution of samples (as in reaction-1). In addition to the comparison between different transfer learning models, we have also undertaken an explicit performance comparison of our protocol with that of earlier approaches by taking the most well studied Buchwald–Hartwig cross-coupling reaction.
Now, consider a practically likely situation wherein one looks for a high yielding coupling partner, in the late-stage functionalization of a potential drug candidate, say using the Buchwald–Hartwig amination or via the asymmetric hydrogenation. Having equipped with the once-in-for-all trained TL-m1 model for these reactions, it is a straightforward task to predict whether or not the intended choice of substrate is likely to be successful. Here, the entire pipeline of feature extraction using quantum chemical computations can be bypassed through the generation of SMILES strings for the concerned substrates, just in a matter of a few minutes, thus making our protocol highly time-economic (ESI section 9†). In essence, when the initial batch of results (say, yield or ee) becomes available during the development of catalytic reactions, our transfer learning approach can be deployed for its training on the new data or augment the previous trained models with such new samples. A refined trained model, thus developed, would be able to quickly predict the outcome for unseen samples, i.e., the ones that are yet to be tested experimentally. This will help channelize time and resources on promising samples rather than trying out on a lot more additional samples.
In summary, our method directly uses SMILES strings as the input representation and thus doesn't require any additional step to convert SMILES to other frequently used alternate representations. This bypasses all the feature extraction steps and thus is highly time-economic. Our method can be used for both yield and ee predictions. We have demonstrated the ability of our protocol on three different catalytic reactions with varying data sizes (<400 to >4000) and data distribution (rich/balanced to sparse/imbalanced). One of the limitations in comparison to physical organic based descriptors is difficulty in gathering additional chemical or mechanistic insights from the trained model. The fine-tuning strategy seems to have an impact on the model performance. One method of fine-tuning, say using gradual unfreezing, may benefit a particular dataset, but not for another problem at the same time. Thus, the possible strategies could be to investigate the effect of fine-tuning on datasets of different sizes to understand how the reaction-specific information can be made more useful to learning. This can be especially of value for small datasets.
Footnote |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d1dd00052g |
This journal is © The Royal Society of Chemistry 2022 |