Sheng Li‡
ab,
Xinqiao Wang‡b,
Yejian Wub,
Hongliang Duanbc and
Lan Tang*a
aCollege of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, P. R. China. E-mail: tanglan@zjut.edu.cn
bArtificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, P. R. China
cState Key Laboratory of Drug Research, Shanghai Institute of Materia Medica (SIMM), Chinese Academy of Sciences, Shanghai 201203, China
First published on 25th November 2022
Deep learning has enormous potential in the chemical and pharmaceutical fields, and generative adversarial networks (GANs) in particular have exhibited remarkable performance in the field of molecular generation as generative models. However, their application in the field of organic chemistry has been limited; thus, in this study, we attempt to utilize a GAN as a generative model for the generation of Diels–Alder reactions. A MaskGAN model was trained with 14092 Diels–Alder reactions, and 1441 novel Diels–Alder reactions were generated. Analysis of the generated reactions indicated that the model learned several reaction rules in-depth. Thus, the MaskGAN model can be used to generate organic reactions and aid chemists in the exploration of novel reactions.
In recent years, owing to the constant development of computational techniques, AI has achieved remarkable results in the fields of retrosynthesis and reaction prediction.3,4 Zheng et al. applied a transformer model to develop a template-free self-corrected retrosynthesis predictor and predicted retrosynthesis reactions with an accuracy of 59.0% on a standard benchmark data set.5 Wang et al. reported a method that utilized transfer learning to enhance the accuracy of a transformer model (94.9%), which is higher than the accuracy of a transformer-baseline model (66.3%).6 Recently, the application of deep neural networks in the field of pharmaceutical chemistry, particularly for drug molecular generation and toxicity risk assessment, has received extensive attention.7,8 Lee et al. applied a generative adversarial network (GAN) to de novo molecular design and demonstrated high performance in the five distribution learning benchmarks of the GuacaMol framework.9 The success of generative models in molecular generation inspired researchers such as Bort et al. and Wang et al. to further explore reaction generative models.10,11
Generative models are an important class of machine learning models capable of generating new data that is not included in a training dataset and have exhibited considerable potential in image,12 text,13 and sound generation14 in the past few years. Among several generative deep learning models, GAN has attracted considerable attention from researchers and was first proposed by Goodfellow et al. as a novel framework for the estimation of generative models in adversarial processes.15 GAN utilizes two adversarial networks, a generator that captures the distribution of data and a discriminator that estimates the probability of a sample belonging to the training data. They compete until the discriminator is unable to distinguish between the real data and the data generated by the generator. This operating mechanism of GAN has allowed it to exhibit superiority over other models via more realistic image generation.16 However, the application of GAN in the field of chemistry had been limited towing to the discreteness of simplified molecular-input line-entry system (SMILES) strings, which replace molecular structures as the input data. To overcome this issue, Sutton et al. proposed a policy gradient-based reinforcement learning approach on GAN that provided feedback on information and has been applied by several scientists.17 Lin et al. used GANs for de novo molecular design, dimensionality reduction, and de novo peptide and protein design,18 while Maziarka et al. reported an improved cycleGAN-based molecular optimization model called mol-CycleGAN that could generate optimized compounds with the desired properties and structures similar to the originally provided molecules.19 Prykhodko et al. proposed a novel deep learning architecture called LatentGAN for de novo molecular design, that combined autoencoders and GANs.20 In this study, we attempt to utilize GANs for generating novel reactions and enhance the scope of chemical reactions. Fig. 1 shows a flowchart for the generation of Diels–Alder reactions using GAN, where “Samples” refers to the dataset and “Generated” refers to the novel reactions generated by the GAN. In this study, we converted the reactions of the dataset into SMILES strings prior to importing them into the GAN model. Subsequently, the discriminator and generator of the GAN were trained to generate the novel Diels–Alder reactions.
The dataset for the GAN model to learn has a significant effect on the model's performance, and thus an applicable dataset must be chosen. As the Diels–Alder reaction is one of the most effective and widely used organic reactions in drug and material synthesis, we chose the Diels–Alder reaction as the training dataset in this study. The Diels–Alder cycloaddition reaction consists of the cyclization of a diene and alkene to form a cyclohexene derivative and was discovered by O. Diels and K. Alder when they established the structure of the cycloadduct of p-quinone and cyclopentadiene in 1928.21 Since its discovery, extensive data on Diels–Alder reactions have been reported, which is another reason for choosing the Diels–Alder reaction dataset in this study. Boger et al. reported that a key step in the synthesis of rubrolone aglycon's seven-membered C-ring in a laboratory was the intermolecular Diels–Alder reaction of an electron-rich diene with an extremely strained dienophile. Furthermore, they reported an excellent cycloaddition yield of 97% with the products exhibiting complete enantioselectivity.22 In a Diels–Alder reaction, the reactants move closer to each other and interact to form a cyclic transition state, which gradually transforms into a product molecule.
In this study, we utilized MaskGAN, which is composed of a generator, discriminator, and critic network, for the generation of novel Diels–Alder reactions.23 The generator network uses a sequence-to-sequence model with an attention mechanism. The Reaxys database was used to construct a training dataset of Diels–Alder reactions, which were converted to SMILES strings and then imported to the MaskGAN model for training and reaction generation. The model generated 1441 novel Diels–Alder reactions and they were compared with the Diels–Alder reaction dataset to verify their novelty. Using the newly discovered reactions from this study combined with AI we could accelerate the discovery of reactions and consequently enhance the accuracy of the prediction of organic reactions.
The generator network of the MaskGAN uses a sequence-to-sequence architecture26 with an attention mechanism.27 The actual implementation of seq2seq in a MaskGAN is a form of a long short-term memory (LSTM) network,28 containing an encoder and a decoder structure. The encoder processes every element in the input sequence and compiles the captured information into a context vector. Then, the encoder sends the context vector to the decoder, which starts producing the output sequence item-by-item to eventually produce the entire sentence. The discriminator uses the same seq2seq structure as the generator with the exception that a scalar probability is output at every time step. In addition, for converging more rapidly, the critic in MaskGAN helps the generator by reducing the high-variance of the gradient updates in a high action-space environment, which enables a more stable training procedure.
To train the model, 80% of the downloaded Diels–Alder reactions were applied while the remainder were used for validation. The model was trained with a batch size of 512 for 300 epochs, the masking ratio of the input sequence was set to 0.1, and the network parameters were optimized using Adam optimization with a weight decay of 0.001. The base learning rate was set to 0.01 and was multiplied by 0.9 to reduce the learning rate for every epoch. All experiments were implemented using PyTorch 1.7.0 (for the detailed version of the package, please access the URL: https://github.com/hongliangduan/Generation-of-novel-Diels-Alder-reaction-using-a-GAN-.git).
To further investigate the generated novel reactions, they were analyzed at the molecular level. Table 1 shows the total amount of valid molecules and their proportion out of 10000 generated molecules for every component in the reactions. Table 2 shows the amounts and proportions of reactants and products in the generated set calculated using different metrics.
Components in the reactions | Valid molecular | ||
---|---|---|---|
Total | Amount | Rate | |
Dienes | 10000 | 7012 | 70.1% |
Dienophiles | 10000 | 7483 | 74.8% |
Products | 10000 | 3048 | 30.5% |
Components in the reactions | Unique molecular | Novel molecular | ||
---|---|---|---|---|
Amount | Rate | Amount | Rate | |
Dienes | 661 | 42.8% | 452 | 68.3% |
Dienophiles | 825 | 62.0% | 628 | 76.1% |
Products | 1394 | 97.0% | 1035 | 74.2% |
As shown in Fig. 4, we observed that when three carbon–carbon double bonds were present in the reactants, the two double bonds of dienophile components chose to react in the s-cis conformation in 98.4% of the reactions in the generated set. This indicates that our model can effectively learn reaction mechanisms. During a Diels–Alder reaction, the s-cis conformation is more favorable in the formation of the transition state. Therefore, dienes that are permanently in the s-trans conformation and cannot adopt the s-cis conformation will not undergo the Diels–Alder reaction. The two ends of these dienes cannot get close enough to the dienophiles in Diels–Alder reactions and could thus result in the formation of a novel six-membered ring of products with a trans double bond. On the contrary, dienes that are permanently in the s-cis conformation, such as cyclic dienes, significantly favor Diels–Alder reactions.
The generated reactions were further analyzed to establish a correlation between the generated reactions and the Diels–Alder reactions of the training set. The reactants were distributed using MACCS29 molecular fingerprints and the t-distributed stochastic neighbor embedding (t-SNE)30 method. t-SNE is a variation on the stochastic neighbor embedding method proposed by Maaten et al.31 and is easier to optimize and reduces the tendency of points to cluster in the center of the map.
MACCS fingerprints are a molecular qualitative descriptor comprising high-dimensional data of 166-dimensional molecular features of various functional groups and 1-dimensional placeholders. t-SNE was used as a dimensionality reduction technique to visualize the MACCS molecular fingerprints of the reactants. Fig. 5(A) shows the t-SNE plot of the distribution of the MACCS fingerprints of the novel dienophile components in generated set and the dienophiles in the training dataset. We observed that the distribution of the training set adequately covered the generated set, which indicates that while the dienophile components generated by the model are novel, they satisfy the features of the reactants of Diels–Alder reactions as well. A similar observation was made with respect to the diene components Fig. 5(B). These results effectively prove that the generated reactions follow the distribution of the features of the training dataset.
The generated reactions were further analyzed based on the level of chemical transformation. Table 3 summarizes the amounts and proportions of chemically feasible, unique, and novel reactions out of the 10000 generated valid reactions. Table 4 shows the proportion of reactions that conform to the regioselectivity and stereospecificity of the Diels–Alder reaction. A chemically feasible reaction conforms to a particular reaction mechanism, and in this study, the reaction mechanism is that of a Diels–Alder reaction. A valid reaction is one in which the products and reactants of the reaction are chemically valid. The chemical feasibility of the model was computed as the ratio of the number of chemically feasible reactions that were screened by the RDKit templates to the number of valid reactions in the generated set. While the proportion of the chemically feasible generated reactions after excluding duplicate chemically feasible reactions indicates the uniqueness of the model, the proportion of the molecules that are present in the training dataset indicates its novelty. The chemical feasibility, uniqueness, and novelty of the model were estimated to be 50.4%, 40.6%, and 21.4%, respectively, which indicate that the model exhibits moderate novelty. We believe this is due to the limited chemical space constituted by the training set with small data. Given that pre-training and data augmentation can improve the training effect of small data. For future work, the utilization of pre-training and data augmentation methods can be explored to overcome this issue. In addition, of 10000 reactions generated with MaskGAN, 438 turn out to be chemically meaningful and novel. Therefore, our success rate is 4.38%. Wang et al. describe a Transformer-Based reaction generation strategy, and their success rate is 2.86% after the same data processing, our method has about 1.5% improvement.11
Total generated valid reactions | Chemically feasible reactions | Unique reactions | Novel reactions | |||
---|---|---|---|---|---|---|
Amount | Rate | Amount | Rate | Amount | Rate | |
10000 | 5042 | 50.4% | 2047 | 40.6% | 438 | 21.4% |
Rule | Rate |
---|---|
Regioselectivity | 100% |
Stereospecificity | 76.6% |
As shown in Table 4, we observed a 100% probability that the model generates reactions with an ortho or para product, which is consistent with our knowledge of the regioselectivity of Diels–Alder reactions, indicating that our model exhibits excellent regioselectivity. Regioselectivity refers to the preference of a reagent to react with a functional group at a particular position over another. Diels–Alder reactions are highly regioselective as one of the carbon–carbon double bonds in a diene is more likely to be attacked by a dienophile at its electrophilic site rather than its electron-donating group. As shown in Fig. 6(A) and (B), the presence of the electron-donating group at one end of the diene results in the other end being more electrophilic, resulting in a higher preference for dienophile components to attack the electrophilic site. When the electron-donating group is located at an end of the diene, the dienophile tends to attack the other end of the diene to produce ortho products, whereas when the electron-donating group is in the middle of the diene (Fig. 6(C) and (D)), the dienophile attack the carbon–carbon double bond of the electron-donating group to produce para products. Therefore, the Diels–Alder reaction is a cycloaddition reaction consisting of an aromatic transition state that is ortho and para directing, and the reactions generated by our model conform to this mechanism.
Stereoselectivity refers to the preferential reaction of a reactant based on the stereochemistry of the product. The mechanism of product formation in Diels–Alder reaction follows the endo rule in which the electron-withdrawing group of the dienophile components and the newly formed carbon–carbon double bond in the middle of the old diene tend to be on the same side during the process, forming an endo product. The bonding interaction between the electron-withdrawing group of the dienophile and the π bond formed at the back of the diene result in an increased rate of endo product formation. In irreversible Diels–Alder reactions, endo products are preferred as kinetic products, whereas in reversible Diels–Alder reactions, exo products are formed instead as exo products are more stable than endo products owing to their lesser steric hindrance. As only irreversible Diels–Alder reactions were selected for the training set, the majority of the generated reactions contain only endo products. The stereoselectivity of one of the novel reactions generated by our model is depicted in Fig. 7(A). The asymmetric dienophile reacts with the cyclic diene, resulting in the formation of carbonyl groups on the dienophile and the newly formed double bond in the middle of the old diene on the same side and hydrogen atoms above the generated ring. This product is an endo product, which conforms to the stereoselectivity rule of Diels–Alder reactions and indicates that our model learned the stereoselectivity of Diels–Alder reactions.
Fig. 7 Analysis of the generated reactions. (A) Stereoselectivity of the generated reactions. (B) Stereospecificity of the generated reactions. |
On further analysis of the generated reactions, we observed that the structure of the products mostly depended on the structure of the reactants, which indicated the stereospecificity of the generated Diels–Alder reactions. As shown in Fig. 7(B)(a), the product sustains a cis configuration when the dienophile component with a cis configuration reacts with the diene in the Diels–Alder reaction, and similarly, the trans configuration is sustained from the dienophile reactant to the product (Fig. 7(B)(b)). During the transition state of the reaction of dienophiles with a trans configuration, one of the functional groups gets tucked under the diene and then reappears underneath the ring when the product molecule is formed to reproduce the trans configuration. The configuration of the diene components Diels–Alder reactions also exhibit a significant influence on the configuration of the products. Fig. 7(B)(c) shows that as both carbon–carbon double bonds of the diene are in cis conformation, the two hydrogen atoms are present below the newly formed six-membered ring. Whereas, when the two carbon–carbon double bonds are in trans configuration, the functional groups of the dienes lie outside the newly formed six-membered ring Fig. 7(B)(d). Therefore, the products of the generated reactions effectively reproduced the stereochemistry of the dienophiles and as shown in Table 4, 76.6% of the reactions with products exhibiting cis/trans isomerism exhibited stereospecificity, which indicated that Diels–Alder reactions are mostly stereospecific.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2ra06022a |
‡ Sheng Li and Xinqiao Wang contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2022 |