Hongliang Duan*a,
Ling Wanga,
Chengyun Zhanga,
Lin Guob and
Jianjun Li*a
aArtificial Intelligent Aided Drug Discovery Lab, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, P. R. of China. E-mail: hduan@zjut.edu.cn; lijianjun@zjut.edu.cn
bDepartment of Pharmacy, The Affiliated Hospital of Xuzhou Medical University, Jiangsu Key Laboratory of New Drug Research and Clinical Pharmacy, Xuzhou Medical University, Xuzhou, Jiangsu 221000, P. R. of China
First published on 8th January 2020
We consider retrosynthesis to be a machine translation problem. Accordingly, we apply an attention-based and completely data-driven model named Tensor2Tensor to a data set comprising approximately 50000 diverse reactions extracted from the United States patent literature. The model significantly outperforms the seq2seq model (37.4%), with top-1 accuracy reaching 54.1%. We also offer a novel insight into the causes of grammatically invalid SMILES, and conduct a test in which experienced chemists select and analyze the “wrong” predictions that may be chemically plausible but differ from the ground truth. The effectiveness of our model is found to be underestimated and the “true” top-1 accuracy reaches as high as 64.6%.
Over the past few decades, various methods involving novel and emerging computing techniques have been developed to perform retrosynthetic analysis.4–6 Since the 1960s, computers have been used to store chemical structures data, and to apply chemical structural information to synthesis planning and drug discovery.7 The first retrosynthesis program created by Corey and Wipke introduced computer assistance to chemical synthesis. This program called Logic and Heuristics Applied to Synthetic Analysis (LHASA).8 They pioneered the use of expertly crafted rules which are regularly alluded to reaction templates.
Another approach to retrosynthesis is Knowledge base-Oriented system for Synthesis Planning (KOSP).9 The system was built on the knowledge base in which reactions were abstracted according to the structural characteristics of reaction sites and their environments. Generally, computer-assisted retrosynthetic analysis is performed by exploiting reaction rules which include a series of tiny transformations to characterize chemical reactions. These rules can either be laboriously encoded by chemical experts, or extracted from various sources of chemical digital data.10–18 The outstanding advantage of the rules is that they can be interpreted directly. However, the rule-based methods remain several drawbacks. First, since there currently is no comprehensive rule system which cover all chemical fields, rule-based systems cannot synthesize new compounds with methods outside the current knowledge base. In addition, the rules need to be coded and curated, and this is prohibitively expensive and time consuming.
Deep learning (DL) is a class of machine learning algorithms which are intended to teach an artificial neural network (ANN) containing a multi-layer nonlinear processing unit data representations.19 Since the earliest ANN was established in 1943, significant improvements have been made between the 1960s and 1980s.20 Moreover, recent advances in DL for computer games and self-driving cars have demonstrated the wide-ranging potential applications of DL.21
Given the increased availability of a wide variety of digital data and algorithms, DL represents a valuable resource for managing reaction data for retrosynthetic analysis. Recently, data-driven approaches have been employed to circumvent the restrictions of rule-based systems. For example, molecules can be equivalently represented as text sequences, as demonstrated with the simplified molecular-input line-entry system (SMILES).22 From a linguistic perspective, this system can be regarded as a language, and a chemical reaction can be treated as a translation task. Nam and Kim were the first to introduce sequence-to-sequence (seq2seq) model which is a neural machine translation (NMT) model to reaction prediction. They mapped the SMILES representations of reactants to the SMILES representations of products.23 Subsequently, Schwaller et al. further built on the idea of relating reaction prediction to a language and explored the potential of an NMT method known as the seq2seq model.24
Given that retrosynthesis is the opposite of reaction prediction, the hypothesis to be tested is that the seq2seq model could deal with retrosynthesis problems as reaction prediction in reverse. Liu et al. tested this hypothesis by formulating retrosynthesis as translation task for the seq2seq model.25 This was achieved by establishing a SMILES target compound, then receiving a SMILES reactant output. When using this approach, 37.4% accuracy was achieved for top-1, and 52.4% and 61.7% accuracies were achieved for top-3 and top-10, respectively. Thus, this approach performed comparably or worse than a rule-based expert baseline model. Meanwhile, the grammatical invalidity rate of the top-1 predicted SMILES was greater than 10%. Therefore, the potential application of this program to future retrosynthetic reaction prediction is limited.
Herein, we present an attention-based NMT model, Tensor2Tensor (T2T) model, which exhibits great superiority to the machine translation tasks while being more parallelizable and requiring significantly less time to train.26 Similar approaches have recently been suggested.27,28 In this paper, our team focus on a central challenge of retrosynthesis: “Given a target product, what are the most likely reactants?” The innovative T2T model we described is applied to retrosynthesis, which procure higher top-1 accuracy (54.1%) than previous work25 and 3% invalidity of the top-1 predicted reactants SMILES on a common benchmark database. Diverse parameters such as batch size and training time are investigated to train the model. We find that batch size should be set as high as possible, while keeping a reserve for not hitting the out-of-memory errors. Meanwhile, extending even training time has the potential to yield better performance. In addition, we analyse incorrect SMILES results and discover that two factors, complexity of chemical structure and a lack of training data, may lead to failure in text presentation. Finally, we conduct a test in which ten experienced chemists pick out and analyse “wrong” predictions that may be true from chemists' point of view but inconsistent with the ground truth.
The architectural characteristic of T2T model is that it entirely depends on attention mechanisms. As a new generation of encoder–decoder neural network model, T2T model, comprising feed-forward network and multi self-attention layers, avoids complicated recurrent or convolutional neural networks. It can get queries (Q) that data inquires, search keys (K) for the indexed knowledge and acquire values (V) related to queries and keys, then matrixes learn them during training. In order to obtain queries (q), keys (k), values (v) corresponding to a current batch, the T2T model multiply Q, K, V with the input (X). With these computed parameters, the inputs can be transformed to some encoding parts or decoding parts.26
As depicted in Fig. 2, the main components of the model are encoder and decoder stacks. The encoder is composed of several same layers and each layer contains two different sub-layers. The first sub-layer is a multi-head self-attention mechanism and the second is a feed-forward network layer. Before layer normalization,33 a residual connection34 is applied around each of the two sub-layers. The decoder consists of identical layers but each layer is comprised of three different sub-layers. Apart from the two sub-layers mentioned, there is a third sub-layer called masked multi-head self-attention mechanism, and the residual connection is still employed around each of the sub-layers as well as the encoder.
Remarkably, the multi-head attention which consists of parallel attention layers is an innovative part of T2T model. We can perform the attention function in parallel to get different versions of output values after linearly projecting the queries, keys, and values, then concatenate and again project them to obtain the final values. Hence, the model with several sets of independent attention parameters outperforms models with a single attention function. Several scaled dot-product attention layers make up a multi-head attention layer. They take the input made up of queries, keys of dimension dk, and values of dimension dk, then calculate the three entities. The dot products of the queries are computed with all keys to explore alignment between the keys with them, then we multiply the results by , and get the weights of the values. The queries, keys, and values that the attention computes will be packed together into matrices in practice so that we can get a matrix of outputs.
T2T model pays same attention to the elements of sequences no matter how long distances between tokens, resulting in information about the relative or absolute position of tokens in sequences may be missing. A positional encoding matrix is proposed to solve such problems. Depending on the positions in sequences and in the embedding directions, the elements equal to the values of trigonometric functions. The positional encoding make connection between far located parts of the inputs with learned embedding. With it, the model can make use of the order of the sequences.
The seq2seq model is based on TensorFlow (version 1.01), which all scripts are written in Python 3.5 and the T2T model is built with TensorFlow (version 1.11.0),37 which all program scripts are written in Python 2.7. The open-source cheminformatics toolkit RDKit38 (version 2019.03.10) is applied for analysing the reaction data.
The performances of our model for different reaction types are also explored and we compare the results to the baseline model. For all of the reaction types examined, our model performs significantly better (Fig. 4). Take reaction class 3 as an example, the accuracy of C–C bond formation reactions in our model achieves 58.0%, which is 11.9% higher than the result (46.1%) achieved with the baseline model.
It is worth noting that reaction class 4, which includes the formation of cyclic structure, usually leads to great variations between reactants and products (Fig. 5). Thus, even for an experienced chemist, it is a challenging aspect of retrosynthesis to decide the proper disconnection bond for a ring. The T2T model achieves an accuracy level of 54.4% for reaction class 4, which markedly exceeds the accuracy achieved with the baseline model (27.8%). However, this result is not as good as that achieved for the other reaction types, and this is attributed to the complexity of cyclic structures.
Fig. 5 A representative example of a recorded versus wrong predicted cycle formation reaction of T2T model. |
After 10 h of training, 43.4% accuracy is achieved with a batch size of 512, and 49.2% accuracy is achieved with a batch size of 2048 (Fig. 6). Thus, it appears that larger batch sizes perform better than smaller ones. However, the accuracy does not significantly increase any more when the batch size exceeds 2048. For example, after 10 h of training, an accuracy of 50.1% accuracy is achieved with a batch size of 4096 is compared with 50.6% accuracy with a batch size of 6144. Furthermore, there is no substantial difference between the accuracies achieved with batch sizes of 4096 and 6144.
Fig. 6 The accuracies of T2T model with different batch sizes. Note that all are trained and tested on a 1080Ti GPU. |
A possible explanation of this phenomenon is that training throughput which is the number of training data processed in the training, markedly affects the performance of the T2T model when a high batch size is involved. To our knowledge, the bigger the batch size is, the slower computation speed is, and as batch size becomes larger, training throughput which equals to multiply batch size by computation speed improves slightly.39 Consequently, the predictive capabilities do not greatly increase when batch size exceeds a certain size, as a result of a mildly higher throughput. Note that setting a batch size too high may result in out-of-memory errors. Conversely, setting a batch size too low could lead to notably low accuracies. Thus, it is of a great advantage to employ proper batch size.
As illustrated in Table 1, a batch size of 8192 achieves an accuracy of 51.8% after 10 h of training. This accuracy is further improved to 53.0% after 5 days of training. Taken together, these results indicate that our model can potentially achieve marginally better result with longer training time. We conduct an additional experiment by averaging checkpoints to improve accuracy. By averaging the last 20 checkpoints saved in 2000-step intervals and training the model for 5 days, we achieve an accuracy of 54.1% with a batch size of 8192. Compared to previous work, using of checkpoint averaging in the present study results in a higher probability of correct predictions. Thus, it is advised that checkpoint averaging be applied to our model.
Batch size | Training time | Accuracy (%) |
---|---|---|
512 | 10 h | 43.4 |
1024 | 10 h | 45.7 |
2048 | 10 h | 49.2 |
4096 | 10 h | 50.1 |
6144 | 10 h | 50.6 |
8192 | 10 h | 51.8 |
8192 | 5 d | 53.0 |
8192 (avg.) | 5 d | 54.1 |
All of the compounds predicted to be grammatically invalid SMILES of seq2seq and T2T models are analysed. Two key factors which may cause models to incorrectly predict text representation are complexity of chemical structure and a lack of training data.
When complicated cyclic compounds such as polycyclic, spirocyclic, and bridged hydrocarbons are components of retrosynthetic analysis, seq2seq and T2T models generally output invalid SMILES (Fig. 7). A key feature of these cyclic compounds is their perplexing ring structure unit. Correspondingly, it is challenging for chemists to name these compounds. Taking spirocyclic hydrocarbon as an example, the systematic naming of it based on rules is rather tough due to its excessive complexity. Moreover, a lack of relative reaction examples can also lead to wrongly predicted reactants SMILES.
In addition, quaternary carbon structures have tremendous influence on the performance of the seq2seq model. While predicting valid SMILES reactants for molecules containing Boc, CF3, and tBu is tough for the seq2seq model, it is a trivial problem for the T2T model.
As shown in Table 2, there are 71 predictions which containing the structures mentioned above and they account for 11.6% of the total invalid SMILES strings in the seq2seq model. In contrast, the T2T model generally does not wrongly predict this class of compounds. Fig. 8 shows representative examples for which the T2T model is capable of predicting SMILES reactants correctly, whereas for the same examples the seq2seq model fails.
Structures | Count | Rate (%) |
---|---|---|
R-Boc | 38 | 6.2 |
R-CF3 | 18 | 2.9 |
R-tBu | 15 | 2.5 |
Total | 71 | 11.6 |
Fig. 8 Characteristic examples of the compound containing Boc, CF3 and tBu, which the T2T model is able to predict valid reactant SMILES, yet the seq2seq model fails. |
The simplest method for synthesizing an acid involves hydrolysis of a carboxylic ester. For example, both benzoic acid methyl ester and benzoic acid phenylmethyl ester can form benzoic acid via hydrolysis.43 As shown in Table 3 (a), the recorded outcome shows that the target compound can be synthesized from hydrolysis of the corresponding methyl ester. The prediction is consistent with the method of Leggio et al.44 (Fig. 9a), which employed hydrolysis of the corresponding benzyl ester to produce the target compound.
Oxidation is a large class of chemical reactions in which the atoms of an element lose an electron. In Table 3 (b) illustrates retrosynthetic analysis of 2-(trifluoromethyl) pyridine which is the key intermediate of BRaf inhibitor. There are a number of oxidants which contain electrophilic oxygen atoms, and these can react with nucleophilic pyridines to produce the target compound. The most commonly used oxidation agents are meta-chloroperoxybenzoic acid (mCPBA)45 and H2O2 which are actually comparable to each other in this reaction. The prediction is chemically plausible due to the oxidizing agent, mCPBA, although it is likely missed in our recorded reaction examples. Aquila et al.46 have reported an approach for obtaining the target compound with the mCPBA (Fig. 9b), and this represents an important contribution to the synthesis of BRaf inhibitor.
During the preparation of complex organic molecules, there are often stages in which one or more protecting groups are introduced into a molecule in order to protect the functional groups from reacting. Often, these protecting groups are not removed until the reaction is completed. The first stage of this reaction is protection, and the second stage is deprotection. Synthesizing the target molecule is a stumbling block on the way to synthesize the intermediate of TORC1/2 inhibitor. Table 3 (c) describes a reaction example in which the precursor is deprotected to generate the target compound. In addition to a benzyl group in the recorded outcome, other alternative protecting groups, such as Boc, Cbz, Fmoc, etc., can be applied to protect the amino group from powerful electrophiles. The prediction involving use of Boc protecting group is not captured in our recorded example, which is mistakenly considered to be a wrong answer. However, this approach has been confirmed to be chemically plausible by Hicks et al.47 (Fig. 9c).
Metal-catalyzed cross-coupling reactions, including Stille coupling and Suzuki coupling reactions, are commonly used to form a C–C bond. Consequently, these reactions play a significant role in organic synthesis. For instance, the recorded outcome showed that the target compound can be formed by Stille coupling between trimethyl(4-methylphenyl)stannane and 1-acetyl-2-bromobenzene (Table 3 (d)). Stille coupling is a versatile C–C bond forming reaction between stannanes and halides, and it has very few limitations regarding the R-groups. Meanwhile, the predicted outcome displays that the target compound can be synthesized via a Suzuki reaction between 1-bromo-4-methylbenzene and (2-acetylphenyl) boronic acid. This is also the route that was chosen by Laha et al.48 (Fig. 9d). These above two methods can substitute for each other on a large scale and they have profoundly affected the protocols for the construction of chemical molecules.
A universal method for generating ethers is to treat alkoxide anions with halohydrocarbons, including chloroalkane, bromoalkene and iodoalkane. It is a remarkable fact that the target molecule is the pivotal intermediate of bazedoxifene acetate used as drug for treatment or prevention of postmenopausal women osteoporosis. Since the two halohydrocarbons can be replaced mutually in SN2 reaction, there is no intrinsic difference between the recorded reaction and the predicted reaction. The ground truth shows a simple SN2 etherification reaction where 1-(4-hydroxyphenyl)ethanone reacts with 2-chloroacetonitrile to form the target compound and the prediction also displays a similar reaction which matches the published route49 (Table 3 (e) and Fig. 9e).
Due to space constraints, we will not describe all of the chemically plausible reaction types in detail. Furthermore, statistical measures are more intuitive than explaining the reaction examples individually. The numbers of various reaction types are indicated in Table 4. It is worth noting that examples of hydrolysis from varied carboxylic esters to acids represent the largest of the seven types of reactions. There are two additional types of reactions, condensation between carboxylic acids and amines and reductions from carbonyl compounds to alcohols, which also warrant mention, with percentages of 2.3% and 1.4%, respectively.
Chemically plausible reaction type | Count | Rate (%) |
---|---|---|
Oxidation by different oxidants | 11 | 0.2 |
Protection with diverse protecting groups | 50 | 1.0 |
Hydrolysis from varied carboxylic esters to acids | 127 | 2.5 |
C–C bond formation via cross-coupling reactions | 36 | 0.7 |
SN2 between alkoxides/amines and alkyl halides | 123 | 2.4 |
Reductions from carbonyl compounds to alcohols | 69 | 1.4 |
Condensation between carboxylic acids and amines | 120 | 2.3 |
Total | 536 | 10.5 |
In conclusion, the T2T model is a valuable resource for predicting reactions and performing retrosynthesis. However, the T2T model is originally constructed for a language translation mission and is not fully adapted to the task of retrosynthetic reaction prediction. It is believed that in the future, some slight adjustments for this model architecture can better solve the problem of retrosynthesis. While few researchers are currently engaged in this area of research, we anticipate a dramatic increase in the coming years as the practical challenges are addressed.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9ra08535a |
This journal is © The Royal Society of Chemistry 2020 |