Paula
Torren-Peraire‡
*ab,
Alan Kai
Hassen‡
*cd,
Samuel
Genheden
e,
Jonas
Verhoeven
b,
Djork-Arné
Clevert
d,
Mike
Preuss
c and
Igor V.
Tetko
a
aInstitute of Structural Biology, Molecular Targets and Therapeutics Center, Helmholtz Munich – Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany. E-mail: ptorrenp@its.jnj.com
bIn-Silico Discovery, Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
cLeiden Institute of Advanced Computer Science, Leiden University, Leiden, The Netherlands. E-mail: a.k.hassen@liacs.leidenuniv.nl
dMachine Learning Research, Pfizer Research and Development, Berlin, Germany
eMolecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
First published on 14th February 2024
Retrosynthesis consists of breaking down a chemical compound recursively step-by-step into molecular precursors until a set of commercially available molecules is found with the goal to provide a synthesis route. Its two primary research directions, single-step retrosynthesis prediction, which models the chemical reaction logic, and multi-step synthesis planning, which tries to find the correct sequence of reactions, are inherently intertwined. Still, this connection is not reflected in contemporary research. In this work, we combine these two major research directions by applying multiple single-step retrosynthesis models within multi-step synthesis planning and analyzing their impact using public and proprietary reaction data. We find a disconnection between high single-step performance and potential route-finding success, suggesting that single-step models must be evaluated within synthesis planning in the future. Furthermore, we show that the commonly used single-step retrosynthesis benchmark dataset USPTO-50k is insufficient as this evaluation task does not represent model scalability or performance on larger and more diverse datasets. For multi-step synthesis planning, we show that the choice of the single-step model can improve the overall success rate of synthesis planning by up to +28% compared to the commonly used baseline model. Finally, we show that each single-step model finds unique synthesis routes, and differs in aspects such as route-finding success, the number of found synthesis routes, and chemical validity.
In recent years, retrosynthesis prediction methods have seen an increase in popularity with the alignment of retrosynthesis with modern machine learning approaches7 which allow users and developers to consider a larger set of potential synthesis routes. The machine learning field of retrosynthesis prediction is commonly separated into two research fields, referred to as single-step retrosynthesis prediction and multi-step synthesis planning. Where single-step retrosynthesis prediction refers to breaking down a product into a single set of reactants and multi-step synthesis planning refers to the search algorithms used to find synthesis routes leading to purchasable compounds (building blocks).
Specifically, single-step retrosynthesis prediction is a supervised learning task, developed to predict which reactions are relevant to a target molecule, and the corresponding reactants required to produce this reaction. There are two commonly referenced categories of single-step retrosynthesis models, template-based and template-free.7 Template-based methods use reaction templates, an abstraction of the reactions in the data, which summarize the underlying pattern of these reactions. There are different approaches to extracting templates, though in all cases these processes aim to represent the atom and bond structures required to perform a reaction,8 where a single template will represent multiple reactions. Template-based methods consider single-step prediction as a classification problem where the task is to predict the appropriate template for the target molecule/product. Examples of template-based methods include NeuralSym,9 the first approach in the field which demonstrated the usefulness of using deep neural networks for retrosynthesis prediction, MHNreact,10 which uses an information retrieval approach to associate products and templates and LocalRetro,11 which uses a graph representation to predict relevant local atom and bond templates for the product.
On the other hand, template-free approaches commonly treat retrosynthesis prediction as a sequence-to-sequence prediction problem,8 employing methods seen in natural language processing such as language translation tasks. Instead of extracting and predicting the corresponding templates, the approach aims to learn the underlying reactions to directly predict reactants. Product and reactants are typically introduced as Simplified Molecular-Input Line-Entry System (SMILES), a common text-based representation of chemical entities. Examples of template-free methods include Chemformer,12 a large pretrained transformer model fine-tuned on the retrosynthesis task, and Augmented Transformer,13 a transformer architecture which employs multiple types of augmentation. Other variations of these approaches exist, such as semi-template-based where the molecule is first broken down into subparts then completed to produce chemically viable reactants.14–16
Multi-step synthesis planning focuses on researching novel synthesis route search algorithms using a single-step model to identify retrosynthetic disconnections. Though multiple methods were developed over the decades to address computer-aided synthesis planning,17 the pioneering machine-learning-based approach in the field uses Monte Carlo Tree Search (MCTS) to plan the traversal of the search tree at runtime guided by a neural network.2 Alternative route planning algorithms use an oracle function or heuristics to guide the tree search instead of relying on compute-expensive run time planning. Prominent examples of this are Depth-First Proof-Number (DFPN),18 which combines classical DFPN with a neural heuristic, Retro*, which combines A* pathfinding with a neural heuristic,19 or RetroGraph, which applies a holistic graph-based approach.20 Other approaches incorporate reaction feasibility into the tree search21 or use synthesizability heuristics in combination with a forward synthesis model.22,23 Finally, self-play approaches, motivated by their success in Go,24 learn to guide the tree search by leveraging information gathered from prior runs of synthesis planning.25–27
Single-step retrosynthesis prediction and multi-step synthesis planning are inherently intertwined where the single-step method defines the maximum searchable reaction network, and the search algorithm tries to efficiently traverse this network by repeatedly applying the chemical information that is stored in the single-step model. However, this connection is not reflected in contemporary research, with only few novel single-step models testing their approaches within a multi-step synthesis planning framework.22,28
Currently, single-step methods are benchmarked by predicting a single retrosynthetic step from a product to reactants. The common benchmark data for these methods, USPTO-50k,29,30 consists of around 50k reactions and only has a limited diversity of 10 reaction classes. These methods are typically only tested on reactant prediction and not within multi-step search algorithms, therefore their usability for synthesis planning is not assessed. Similarly, multi-step search algorithms benchmark the route-finding capabilities of their method using a single single-step model, often based on the template-based NeuralSym model,2,18–20,27 and evaluate the success rate of finding potential synthesis routes for molecules of interest. However, multi-step approaches do not consider the impact of alternative single-step models, a vital aspect of the search, as the route planning algorithm uses the reaction information stored in the single-step model to find synthesis routes and create alternate reaction pathways within the reaction network.
The current question remains whether contemporary single-step retrosynthesis methods are transferable to the multi-step synthesis planning domain, and their impact on multi-step synthesis planning.31,32 In this work, we address the transfer between single-step and multi-step methods by incorporating different contemporary single-step models within a common multi-step search algorithm to analyze the use of these models for multi-step synthesis planning. We explore the effect on performance, analyzing the relationship between contemporary single-step and multi-step performance metrics using both public and proprietary datasets of varying size and diversity. Moreover, we also focus on vital aspects such as model suitability and chemical validity of the predicted routes.
Task | Metric | Description |
---|---|---|
Retrosynthesis | Top-N accuracy | Percentage of compounds for which the ground-truth reactants are predicted within the top-n |
Multi-step synthesis planning | Success rate | Percentage of compounds where at least one solved synthesis route is produced |
Number of solved routes | Average number of unique solved synthesis routes produced per molecule | |
Search times | Average search time per molecule | |
Single-step model calls | Average number of single-step model calls per molecule | |
Route accuracy | Percentage of compounds where the gold-standard route is predicted within the top-n synthesis routes | |
Building block accuracy | Percentage of compounds where the gold-standard building blocks are predicted within the top-n synthesis routes |
Within success rate, we measure the percentage of molecules for which the route planning algorithm can successfully return at least one solved synthesis route leading from a molecule to building blocks. This condition is required for synthesis routes since a chemist can only consider routes as a suggestion for experimental evaluation if a complete synthesis route is found. Moreover, we analyze the number of solved routes since not only is it interesting to identify if there is a possible synthesis route for a molecule, but also how many alternatives are produced, given that different synthesis routes have different route properties.
Nevertheless, algorithmic success does not measure if a found synthesis route is chemically valid, but only if a route into building blocks is found. Route accuracy is used to measure the chemical validity of synthesis routes as predicted routes can be compared to published, experimentally tested gold-standard routes.33 Naturally, a route planning algorithm should be able to recover the gold-standard routes within the set of predicted, solved synthesis routes. This task is inherently more complex than producing solved routes (success rate) since it requires a sequence of multiple reactions and their intermediates to be correctly predicted and in the correct order. Additionally, we calculate whether there is an exact match between the predicted building blocks and the gold-standard building blocks. Building block accuracy differs from route accuracy since the route reactions and intermediates are not considered. In all cases it must be noted that a gold standard route is only one possible way of synthesizing a target molecule.
Lastly, we consider search times and single-step model calls. Ideally, synthesis planning algorithms should produce routes in a timely manner to reduce allocated computational resources. However, different single-step models can have different inference speeds, and the time required for a search can massively diverge.32 Consequently, the average search time for a molecule with a fixed number of single-step model calls, is measured. Additionally, we report the number of single-step model calls since, in some cases, the method may not reach the maximum iteration limit in the maximum search time. Noteworthy, the maximum search time can be exceeded if the last search iteration is started before the time limit is reached.
Task | Dataset | Description |
---|---|---|
Single-step retrosynthesis training | USPTO-50k30 | Default single-step retrosynthesis benchmark dataset |
USPTO-PaRoutes-1M33 | Largest publicly available single-step retrosynthesis dataset | |
AZ-1M34 | 1M reaction subsample of internal AstraZeneca reactions | |
AZ-18M34 | Dataset based on internal AstraZeneca reactions | |
Multi-step synthesis planning evaluation | Caspyrus10k | 10000 clustered bio-active molecules from papyrus35 |
PaRoutes33 | Collection of 10000 gold-standard synthesis routes extracted from patents |
Additionally, we use two datasets based on the proprietary AstraZeneca dataset.34,36 The first, AZ-18M, is the complete cleaned dataset from AstraZeneca, which includes Reaxys,37 Pistachio (a superset of USPTO-PaRoutes-1M),38 and AstraZeneca Electronic Laboratory Notebooks (ELN) data. This dataset contains 18697432 single-step reactions.34 Moreover, to obtain a dataset representative of AZ-18M with a comparable size to USPTO-PaRoutes-1M, we randomly subsample 1M reactions from AZ-18M to produce AZ-1M.
To evaluate single-step models, we split all reaction datasets into random 80% training, 10% validation, and 10% test hold-out splits. In the case of USPTO-PaRoutes-1M, to replicate the original data split size,33 the hold-out split ratio is 90% training, 5% validation, and 5% test. We defer from using the original hold-out splits since they are based on template stratification. For AZ-18M, we randomly subsample 100k molecules from the complete test set of 1.8 million reactions to avoid excessive evaluation computation.
In detail, we select the high-quality papyrus35 dataset of 1238835 molecules, combining sources such as ChEMBL39 and Escape-DB,40 where each molecule has an exact bioactivity value measure and is associated with a single protein. We filter those molecules with the Guacamol cleaning strategy41 to ensure drug-like molecules, removing molecules which do not fit the criteria in the process. As we are interested in these molecules for synthesis planning, we remove the building blocks present in Zinc,42 Enamine,43 MolPort,44 and eMolecules.45 Finally, we cluster the resulting set of molecules using Butina Clustering46 using Morgan Fingerprints with a radius of 2, a fingerprint size of 1024, and a Butina cut-off threshold of 0.6. Out of the resulting 137963 centroids, we remove centroid molecules in clinical phases47 to ensure we focus on molecules most relevant to the early drug discovery stages. In detail, we skip two centroids when selecting the largest 10000 cluster centroids, which results in an evaluation dataset of 10000 molecules representing roughly 281000 molecules.
Additionally, we evaluate the synthesis planning capabilities of all single-step models on PaRoutes,33 a collection of 10000 gold-standard retrosynthesis routes. This task differs from the general synthesis planning task with Caspyrus10k in that the goal is to recover specific real-world synthesis routes conducted as part of a patent application process and therefore test the chemical validity of the predicted synthesis routes. The gold-standard routes are obtained from USPTO patent data, where we use the n−1 set, which contains a single retrosynthesis route for each patent. As stated in the PaRoutes dataset, we use a specialized set of building blocks containing the leaf nodes of all 10000 routes. Given the specifics of the PaRoutes dataset, the search algorithm has a maximum route length of 10 as this is the longest extracted route length from patents.
Task | Approach | Description |
---|---|---|
Single-step retrosynthesis | LocalRetro11 | Graph neural network predicting the application of local bond and atom templates |
Chemformer12 | Template-free sequence-to-sequence transformer | |
MHNreact10 | Template-based information retrieval method relating products and template embeddings | |
AZF9,42 (baseline) | Default template-based method | |
Multi-step synthesis planning | Retro*19 | Best-first tree search algorithm leveraging A*-like pathfinding guided by the single-step model |
LocalRetro11 is a template-based method that uses local atom and bond templates. It applies a graph neural network to create embeddings for both atoms and bonds of a product, which are used in a classification task to predict appropriate templates and reaction centers jointly. Contrary to the original implementation of the method, for AZ-1M and AZ-18M we filter for a minimum template frequency of three to avoid an infeasible number of local atom and bond templates.
Chemformer12 is a template-free method based on a transformer architecture that uses BART48 pre-training on molecular SMILES and is then fine-tuned on the retrosynthesis task. It uses product SMILES as input to predict reactant SMILES using beam-search. We set the beam size to 50.
MHNreact,10 a template-based information retrieval approach, trains separate product and template encoders and uses modern Hopfield networks49 to relate products and template embeddings to find the most applicable reaction template. The original implementation uses all template embeddings simultaneously. However, due to large RAM requirements (>300 GB) of this approach for USPTO-PaRoutes-1M, AZ-1M and AZ-18M, the templates are used in batches to train the model. Moreover, we apply a cut-off of a minimum of three template occurrences for AZ-1M and do not show results for AZ-18M as due to increased reaction diversity leading to a much larger number of templates requiring an unfeasible amount of memory.
Additionally, we include a simple template-based model as a baseline referred to as AZF, adapted from NeuralSym,9 which is the default model in the most used public route planning software implementation AiZynthFinder.42 Noteworthy, this model architecture is also commonly used to benchmark novel multi-step search algorithms. Templates are extracted using the standard implementation of RDChiral50 with a radius of two. Only templates with at least three occurrences are kept for USPTO-50k, USPTO-PaRoutes-1M, and AZ-1M, for AZ-18M templates with at least ten occurrences were kept, following.34
Fig. 2 Single-step retrosynthesis prediction Performance in terms of top-n accuracy for AZF, LocalRetro, Chemformer, and MHNreact on different datasets (USPTO-50k, USPTO-PaRoutes-1M, AZ-1M, AZ-18M) (see ESI Table S1†). |
Training dataset | Model | Overall | Average per molecule | ||
---|---|---|---|---|---|
Success rate (%) | Solved routes | Search time (s) | Model calls | ||
USPTO-50k | AZF | 41.1 | 36.1 | 159 | 199 |
LocalRetro | 74.1 | 124 | 161 | 200 | |
Chemformer | 62.4 | 7.37 | 19051 | 177 | |
MHNreact | 51.0 | 38.0 | 28958 | 99 | |
USPTO-PaRoutes-1M | AZF | 66.3 | 83.5 | 163 | 200 |
LocalRetro | 86.0 | 324 | 1218 | 200 | |
Chemformer | 94.1 | 463 | 28809 | 147 | |
MHNreact | 64.6 | 215 | 28839 | 169 | |
AZ-1M | AZF | 73.5 | 124 | 168 | 200 |
LocalRetro | 88.1 | 321 | 465 | 200 | |
Chemformer | 94.5 | 358 | 29109 | 108 | |
MHNreact | 56.0 | 77.0 | 29116 | 65 | |
AZ-18M | AZF | 76.2 | 154 | 154 | 199 |
LocalRetro | 87.3 | 350 | 2736 | 200 | |
Chemformer | 90.9 | 381 | 30212 | 75 |
Training dataset | Model | Overall | Average per molecule | ||
---|---|---|---|---|---|
Success rate (%) | Solved routes | Search time (s) | Model calls | ||
USPTO-PaRoutes-1M | AZF | 97.1 | 159 | 153 | 200 |
LocalRetro | 98.9 | 161 | 1067 | 200 | |
Chemformer | 99.7 | 524 | 28538 | 157 | |
MHNreact | 91.1 | 173 | 28802 | 156 |
The route accuracy of the single-step model in synthesis planning measures how often the gold-standard synthesis route is recovered for a target molecule, where the selected n−1 set33 features only one retrosynthetic route per target-molecule. AZF has by far the best route accuracy overall, recovering 61.8% of gold-standard routes within its top-50 predicted synthesis routes (23.7% at top-1) (Fig. 3). Noteworthy, the performance plateaus after top-10 (at 60.7%) and with little improvement at higher top-n. Both contemporary template-based methods perform similarly across all top-n, but underperform compared to AZF by around −20% (MHNreact: 39.7%, LocalRetro: 36.1%). The template-free Chemformer model is worst-performing across all top-n, reaching only 11.9% by top-50. Noteworthy, the performance for all contemporary models improves until the top-1000 (ESI Fig. S5†), but never reaches the performance of AZF.
Fig. 3 Multi-step synthesis planning accuracy on PaRoutes gold-standard synthesis routes with different single-step models trained on USPTO-PaRoutes-1M. Route accuracy measures the ability to recover the correct synthesis route within top-n, whereas building block accuracy measures the ability to recover the correct building blocks while not considering reactions and intermediates (see ESI Table S7†). |
Considering the building block accuracy, which measures if the correct building blocks of the reference route are predicted while not considering the route reactions or intermediate molecules, considerable improvements for all models are present compared to the route accuracy. Within the top-50 synthesis route predictions, AZF correctly predicts the building blocks for 76.4% of the gold-standard synthesis routes, a +14.6% increase over its route accuracy. This improvement pattern is also present for the contemporary models within the top-50 predicted synthesis routes, where all three contemporary methods see a considerable improvement with at least a +17% improvement between route and building block accuracy.
Additionally, a model performing well on the smaller 50k reaction dataset does not necessarily perform well on larger, more diverse datasets, as the ranking of the best-performing single-step model changes for every dataset. Generally, model performance increases, or stays comparable, with more data available. For instance, for USPTO-PaRoutes-1M, a superset of USPTO-50k with a larger number of reaction classes, the performance increases (AZF and Chemformer) or stays comparable (LocalRetro, MHNreact). This pattern is also present when comparing AZ-1M to its superset AZ-18M, where more data improves the performance slightly (LocalRetro) or substantially (Chemformer, AZF). For AZ-18M, the model with the highest jump in performance is the template-free Chemformer, reaching a top-50 accuracy of 83.1% and substantially outperforming all other template-based methods by +27.3%. Here it seems that the template-based nature of the other two models (AZF, LocalRetro) limits their ability to perform on the largest, most diverse dataset. This indicates that template-based methods may have reached a performance plateau due to not being to extrapolate beyond known templates, a limitation which is not present for the template-free Chemformer. Interestingly, for USPTO-50k, the template-free method is outperformed by all template-based methods at top-10 accuracy. Looking at the performance of AZF on AZ-18M, it is generally worse than shown in ref. 34. The previous work uses a template-based stratified split for the hold-out split, leading to an even distribution of templates across the different splits and ensuring that every template is present in every split, which can benefit a template-based approach. However, in this work, we address the hold-out split by a strict random split on the reaction level, given the nature of the different single-step methods used. With increased data diversity, single-step performance diminishes for all models comparing the equally sized USPTO-PaRoutes-1M and AZ-1M (Fig. 2). Data diversity is measured by the number of extracted unique reaction templates from the training splits of both datasets (USPTO-PaRoutes-1M: 314959, AZ-1M: 439618), representing different reaction ideas present in the respective datasets. This pattern is especially problematic, as USPTO-50k only includes ten reaction classes (USPTO-50k: 10196 unique reaction templates).
Noteworthy, USPTO-PaRoutes-1M,34 with its higher number of reactions and reaction diversity, is also not a perfect single-step model benchmark dataset since all single-step models perform comparably on it. Compared to the alternative public dataset USPTO-Full,52 the performance of all single-step models is much higher on USPTO-PaRoutes-1M, where LocalRetro has a more than +25% top-50 accuracy improvement.8 The difference in single-step performance between USPTO-PaRoutes-1M and USPTO-Full and the equal performance on USPTO-PaRoutes-1M might be explainable by the underlying data sources and their respective preprocessing. USPTO-PaRoutes-1M is a superset of USPTO-Full, where the first contains USPTO grants and applications (3748191 total reactions) and the latter only USPTO grants (1808938 total reactions).36 In terms of preprocessing, USPTO-Full is noisier compared to USPTO-PaRoutes-1M as the latter applies extensive data cleaning and recreates and standardizes the atom-mapping between reactions with RXNMapper.53 Naturally, given that all tested single-step models perform comparably on the most cleaned, standardized, publicly available dataset, the question remains whether a saturation point in single-step performance is reached on public data.
Directly inferring multi-step synthesis planning results from single-step retrosynthesis results is not possible since single-step model performance metrics do not directly transfer to multi-step route planning success. In fact, it is necessary to evaluate the performance of respective single-step models in a multi-step framework to evaluate their synthesis planning performance. In this study, single-step models performing equally well on the USPTO-PaRoutes-1M single-step task are performing vastly differently in multi-step synthesis planning. For example, Chemformer, compared to MHNreact, has considerable differences in multi-step performance with a nearly ±30% higher success rate and finding double the average number of solved routes per molecule (Table 4). Moreover, LocalRetro has a roughly +20% higher success rate than AZF and finds 3.9× the number of solved routes. Looking at the disparities between USPTO-50k and other datasets, LocalRetro has the highest route-finding success of single-step models trained on the USPTO-50k dataset but is not the best-performing model when trained on larger datasets. Additionally, low single-step model performance on AZ-1M still leads to high multi-step performance. Here, the high diversity of reactions in AZ-1M, compared to the equally sized USPTO-PaRoutes-1M, might be the factor for the low single-step model performance. It seems that with fewer correctly predicted reactions, it is still possible to reach high multi-step performance. This aligns with prior works showing that most molecules can be addressed with relatively few reaction templates.36
Even though single-step retrosynthesis models improve the performance of route planning, they are generally not tailored to multi-step search algorithms. Single-step models have slow inference times that can deny high multi-step success rates, as few single-step model calls are possible within a set time limit and can also impede ad hoc synthesis route generation. Attached to the inference problems of single-step models are the algorithmic properties of most multi-step algorithms. Though multi-step algorithms require single-step retrosynthesis models, they are generally developed to address a single molecule as a sequential next-disconnection prediction problem, with few exceptions.20 Single-step models, however, are not optimized for this as they predict reactants for multiple different products simultaneously, typically in a joined GPU batch. Consequently, the combination of single-step and multi-step methods, though both thought for the task of retrosynthesis prediction, are currently not developed to be complementary to each other. Moreover, novel search algorithms, such as implementing asynchronous route planning, could have a substantial impact in this area.
Generally, different models produce different route characteristics on Caspyrus10k (Fig. 4), where the template-free method has noticeably different maximum route length, number of building blocks and number of reactants compared to the template-based methods. AZF and LocalRetro generally have very similar distributions across all characteristics, particularly in maximum route length where MHNreact has markedly shorter routes. Since MHNreact carries out a low number of single-step model calls within the maximum search time, it is likely that it is only able to address and solve short routes. Yet, Chemformer generally has a higher proportion of routes with a maximum depth of one, essentially directly predicting building blocks. Additionally, Chemformer predicts a higher number of building blocks per route compared to all template-based methods, yet this effect is reduced with increased training data (ESI Fig. S2†). Within the template-based methods we observe that the majority of reactions are bimolecular, producing two reactants, this is particularly true for MHNreact. Chemformer on the other hand predicts reactions which at times lead to four or more reactants.
Apart from looking at general route statistics of Caspyrus10k route planning results, we cluster the resulting synthesis routes to understand the relationship between different solved routes produced by distinct models within a reaction dataset. In detail, the approximated pairwise edit distance between solved synthesis routes of the top-5 predictions for each molecule is used to cluster with the route-distance package.54,55 Here, different single-step models produce unique route clusters when looking at the same training data, where routes produced by each model are generally unique to that model (Fig. 5). Noteworthy, routes produced by methods that rely on reaction templates (AZF, MHNreact, LocalRetro) tend to cluster together more frequently. Furthermore, models trained on AZ-18M tend to converge more regarding shared routes between models than models trained on USPTO-PaRoutes-1M. Nevertheless, the bulk of routes remains in unique clusters. Noteworthy, we check that the clustering patterns are also present when removing MHNreact (Fig. S3†) to ensure that the missing MHNreact results for AZ-18M are not the sole reason for the difference between AZ-18M and the other datasets.
Fig. 5 Distribution and overlap of route clusters per single-step model and dataset when clustering with route-distance package.54,55 Clusters were calculated on a per molecule basis, N clusters shows the number of clusters which contained the stated combination of models. |
The availability of solved synthesis routes does not imply that those routes are also chemically valid. Validity can be assessed by comparing the produced routes of a single-step model to gold-standard routes as found in USPTO patents33 to indicate how valid the produced routes are. Generally, different single-step models are distinctive in their ability to reproduce gold-standard chemistry routes, i.e., route accuracy (Fig. 3). Surprisingly, there is no relationship between the multi-step success rate and the route accuracy of a single-step model. All models achieve at least 91% success rates on PaRoutes target molecules (Table 5) but differ considerably between route accuracies. AZF is the best-performing model regarding route accuracy, recovering 23.7% of routes as the top-1 predicted synthesis route and 61.8% within the top-50 predicted routes. In comparison, contemporary models produce lower route accuracy, even if they produce high success rates. Within those contemporary models, template-based models (LocalRetro and MHNreact) have a considerably higher route accuracy than the template-free approach Chemformer, yet still have a considerable gap in performance compared to the route accuracy of AZF.
Instead of predicting the correct gold-standard synthesis route, an easier task is to predict the right building blocks of the gold-standard route. This means that though the gold-standard route may not be entirely correctly predicted the building blocks are correctly predicted in the synthesis route, i.e., the order of the reactions may be incorrect or intermediate molecules are missing. For the easier task of predicting the correct building blocks, all models improve their performance compared to their respective route accuracy. However, the improvement between route accuracy and building block accuracy is much greater, compared to AZF, for contemporary models that operate on local reaction templates (LocalRetro) or no templates at all (Chemformer), potentially meaning that they are more likely to skip vital aspects of the gold-standard synthesis routes in their route predictions rather than producing a distinct retrosynthesis route than the gold-standard route. Overall, the template-based AZF method performs best regarding building block accuracy.
The performance difference on PaRoutes across different methods might be explainable by the allowed degree of chemical freedom of their respective model architectures. Template-based methods are more constrained by the reaction templates they apply, which are extracted from training reactions. With this constraint they are made to follow reaction pathways which are more chemically sound since their templates by definition, must be based on previous reactions. In comparison, the template-free Chemformer performs worst across both route and building block accuracy, potentially explained by the non-existent template guidance of the method allowing it to predict non-chemically sound reactions. Interestingly, this is in line with the divergence of Chemformer from general route statistics on Caspyrus10k, as the model predicts a much higher number of building blocks, multi-molecular reactions and routes that only consist of a singular reaction (Fig. 4).
Generally, contemporary approaches can provide a much larger set of route alternatives (ESI Fig. S5†). This is also reflected in the PaRoutes route and building block accuracy, where AZF plateaus by top-10 accuracy, whereas contemporary methods continue to increase their accuracy into very high top-n (ESI Fig. S4†). Given that contemporary models produce more route alternatives, a future research direction, might be the best ranking of synthesis routes, as it can be assumed that desired routes are present within a large set of found synthesis routes.
An underlying assumption for single-step and multi-step synthesis planning is that the single-step model prior indicates the predicted chemical viability of a reaction for a molecule. We assess this assumed relationship by extracting the predicted reaction probabilities and their respective rank for reactions from the top-10 solved routes of the PaRoutes benchmark dataset by analyzing a random subset of 100000 reactions for each model. Interestingly, models with a smoother progression between probabilities of higher and lower ranked disconnections (AZF, LocalRetro, MHNreact) tend to perform better at recovering gold-standard routes (Fig. 6 and 3). In contrast, a more skewed, overconfident, distribution towards top-1 predictions tends to perform worse (Chemformer). In all cases, reactions of solved synthesis routes contain both low probability reactions and low prediction rank (ESI Fig. S6†).
Fig. 6 Single-step model prior and rank distributions of reactions from the predicted and solved PaRoutes synthesis routes. A random sample of 100000 reactions is extracted from the top-10 predicted routes (see Fig. 3) for each single-step retrosynthesis model trained on USPTO-PaRoutes-1M. |
Though the routes found within the top-10 predicted routes use reactions with very low reaction probabilities, gold-standard routes are generally only found within the top-5 predicted reactions (ESI Fig. S4†). This suggests that routes with reactions ranked outside the top-5 predicted reactions, though leading to building blocks, produce non-viable route reactions (Fig. 6). The presence of these low-probability reactions can be explained by the search algorithm ranking possible synthesis routes by their ability to reach building blocks and their overall route length. In the tree search itself, the search algorithm prioritizes short and solved routes, which might also include reactions with low probabilities as the overall search goal is to find a synthesis route ending in purchasable building blocks. The effect of low-probability reactions is enforced by adding 50 reactions to the search tree at every time step, even if those disconnections have low probabilities. Noteworthy, it is likely that the tree search algorithm explores those low-probability reactions when the high-probability disconnections are already explored. However, given the overall distribution of reaction priors (Fig. 6), this approach might not be desired for future synthesis planning search algorithms. Furthermore, in future work, it could be interesting to analyze how the synthesis planning results differ when applying only the top-5 predicted reactions, consequently limiting the breadth of the search tree. Given that gold-standard routes are only found within the top-5 predicted routes (ESI Fig. S6†), it opens the question if the resulting synthesis routes are closer to human-desired routes.
When discussing gold-standard synthesis routes, it is important to point out that a gold-standard route is only one way of synthesizing a desired molecule and other valid synthesis routes might also be possible. However, a good synthesis planning application should be able to prioritize real-world routes from a set of all potential routes, even if the favored chemical reactions change over time. Not finding the real-world routes entirely, yet identifying the correct building blocks, indicates that the produced synthesis routes are invalid or potentially missing vital parts of the synthesis route to be directly useful in an experimental setting. Naturally, there is a clear connection between the ability to recover gold standard routes and the ability to predict solved routes at all. High success rates produce route candidates that might be potential real-world synthesis routes but need to consider chemical validity. Because of this lack of validity, candidates are currently treated as initial retrosynthetic ideas. For a real improvement in the field of retrosynthesis, one of the essential questions, beyond improving the generation of possible solved route candidates, is how to evaluate and improve the chemical validity of generated synthesis routes. For this, it is vital to introduce reagents, conditions and yields into synthesis planning in the future and address the chemical feasibility of the generated routes. Though there is currently a lack of in silico synthesis feasibility evaluation, as methods like round-trip accuracy22 only measure if the product is recoverable from the reactants and do not consider full chemical validity, given that retrosynthesis methods do not produce the relevant reagents and conditions required. Newer works have attempted to address this problem by predicting all required components.23 Chemical validity, however, could potentially be addressed with new advancements in the field, such as molecular dynamics or quantum chemistry prediction.
Finally, when selecting the single-step retrosynthesis model for route planning, there are trade-offs between different desired search properties, as no approach outperforms all others if one uses a large enough dataset like USPTO-PaRoutes-1M. Clearly, there is a single-step performance advantage of template-free single-step models on large, heterogenous reaction data. However, this advantage comes at the cost of inference speed at multi-step synthesis planning, where template-based models are generally preferred as they can perform over 200-fold faster than template-free. If the overall goal of synthesis planning is a high success rate with a high average number of produced solved routes while accommodating long search times and a high divergence from reference routes, then the template-free approach, Chemformer, may be relevant. With a slightly lower success rate and average number of solved routes but much shorter runtimes and medium divergence from reference routes the successful contemporary template-based model, LocalRetro, is of interest. For very short run times and low divergence from reference routes yet lower success rate and an average number of solved routes, the default single-step retrosynthesis model, AZF, will be of use. Future developed models can aim to address a combination of these goals.
One of the underlying problems in the field is that benchmarking different single-step retrosynthesis models within synthesis planning is time- and resource-intensive. In order to facilitate such benchmarking in the future, we analyze the variance of different subsample sizes of the Caspyrus10k multi-step synthesis dataset such that an approximation of the results can be carried out in lieu of running the full datasets for faster benchmarking/prototyping (see ESI Tables S2–S5†). In detail, we repeatedly randomly subsample a subset of molecules (100, 500, 1000, 5000 molecules) and measure the mean and standard deviation across 1000 subsamples (sampling without replacement). Given that the standard deviation is reasonably small for a sample size of 1000 molecules (see ESI Table S4†), we provide a selected set of 1000 molecules if a full evaluation is not feasible (see ESI Table S6†).
Noteworthy, this work only explores three contemporary and a common baseline single-step retrosynthesis models, and even though representative of the common research directions, gives us only a snapshot of possible single-step and multi-step retrosynthesis combinations. In the future, it might be interesting to increase the evaluation framework beyond a well-chosen singular search algorithm as search algorithms are an active research field and newer approaches could improve the performance further.
For multi-step synthesis planning, we show that the single-step model is an essential but thus far underexplored aspect of the search algorithm. By merely changing the single-step retrosynthesis model it is possible to improve route-finding success by up to +28%, reaching success rates above 90% compared to the commonly used baseline model, when trained on the same reaction datasets, showing the benefit of further exploring single-step models within future multi-step frameworks. Furthermore, we show that every single-step model produces unique synthesis routes when used in multi-step synthesis planning with varying route characteristics, and each single-step model also differs in important aspects such as route-finding success, the average number of found synthesis routes, search times, and chemical validity. To summarize, we show that the combination of single-step retrosynthesis prediction and multi-step synthesis planning is a crucial aspect when developing future methods.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00252g |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2024 |