Samuel
Genheden
* and
Esben
Bjerrum
Molecular AI, Discovery Sciences, R&D, AstraZeneca Gothenburg, SE-431 83 Mölndal, Sweden. E-mail: samuel.genheden@astrazeneca.com
First published on 12th July 2022
We introduce a framework for benchmarking multi-step retrosynthesis methods, i.e. route predictions, called PaRoutes. The framework consists of two sets of 10000 synthetic routes extracted from the patent literature, a list of stock compounds, and a curated set of reactions on which one-step retrosynthesis models can be trained. PaRoutes also comes with scripts to compute route quality and route diversity, quantities that are important for comparing methods. As an illustration of the framework, we compare three template-based methods implemented in the AiZynthFinder software: Monte Carlo tree search (MCTS), Retro*, and a depth-first proof-number search (DFPN) algorithm. It is found that DFPN is inferior to both MCTS and Retro* and cannot be recommended in its current implementation. MCTS and Retro* are on a par with regard to search speed and the ability to find routes in which all starting material is in stock. However, MCTS outperforms Retro* when it comes to route quality and route diversity. MCTS more easily recovers the reference routes and tends to find a diverse set of solutions for a greater portion of the targets. Having showcased the benchmark for template-based methods, we discuss potential issues and caveats needed when adapting the framework for other methods,.e.g., template-free methods or expert systems. We will continue to update and expand the application of PaRoutes, and we also encourage practitioners and developers to adapt PaRoutes to their algorithms as we envisage that the framework could become the community standard to compare retrosynthesis route predictions. PaRoutes is available at https://github.com/MolecularAI/PaRoutes.
One-step retrosynthesis methods have received the greatest amount of attention: there is a plethora of methods described in the literature that uses a diverse range of methods to extract synthesis rules, cheminformatic representations, neural network architectures, sampling techniques, etc. [see e.g.5–15]. For the practitioner that wants to use a method for predictions, or for a researcher developing a novel method, there are several comparisons for a subset of the available methods. In fact, once a novel method is developed it is customary to benchmark it against other methods using a common set of known reactions. The US patent office (USPTO) extracts provided by Lowe16 is the de-facto standard for comparing single-step retrosynthesis method, as it is one of few open-source datasets of chemical reactions. However, there are more than one curated subset of the dataset used,17,18 which is a complication. Furthermore, there has not been a survey published showing all one-step methods side by side, so it is still difficult to understand the range of available methods, although reading recent studies like19 could provide a good overview.
Multi-step retrosynthesis methods, or route prediction methods, have received less attention. To the practitioner, the pool of available solutions is dominated by commercial and closed-source alternatives,2 although a few complete, open-source packages have emerged, such as the ASKCOS suite from MIT20 and AiZynthFinder from AstraZeneca.21 An interesting tool is RXN for chemistry that not only predicts reactants but also reagents and is available through a free application programming interface.22 In addition, there have been a few algorithms described in the literature, which we will summarize in Section 4. What is lacking with regards to the route prediction methods is comparative studies. There is a lack of consensus on how to compare route predictions and what data one should do the comparison on. One reason for this could be that there is no public database of synthetic routes, as there is the USPTO data for single-step reactions. A further complication is that there is typically no unique way to synthesize a compound and several alternative routes could be used that are optimal for different scenarios. With regards to the data problem, there have been a few attempts to collect a set of routes on which one can perform a comparison: Heifets and Jurisica23 compiled a suite of routes from organic chemistry examinations that they then used to benchmark their approach. Unfortunately, the suite is extremely small, only 20 routes, and it is therefore difficult to make statistical analysis of the route predictions. Chen et al.24 extracted routes from the USPTO dataset to train a neural network for computing the cost of synthesizing a molecule. They found all compounds in the USPTO dataset that had a route to compounds in the eMolecules database,‡ and then performed further selections to arrive at a test set of size 189. Interestingly, they also used this to compare four different multi-step retrosynthesis methods (and variants thereof). They focused the comparison on computing time, the number of targets for which a route was found, and the length of the routes. Furthermore, Bradshaw et al. also extract routes from a cleaned sub-set of USPTO to train a deep generative model.25 Like in24 they also utilize a specification of building-blocks to define the reaction network and to find synthetic routes. Finally, Mo et al.26 also extracted routes from the USPTO dataset for training their neural network for predicting the human-likeness of routes. Instead of extracting routes from the full reaction network as in24,25 they extracted routes within a patent. Each patent consist of one or more reactions and a reaction network would have highly related reactions. Using a complete depth-first search of such networks for each patent, 238 K routes were extracted. This is an excellent number of routes for training a neural network model but is probably too excessive for benchmarking route prediction methods. However, we will build upon this methodology in Section 2 to extract routes suitable for benchmarking multi-step retrosynthesis tools. We will then proceed to discuss and suggest metrics to compute when comparing predictions on these routes in Section 3. Finally, we will apply this framework on an illustrative comparison of three kinds of search algorithms for template-based retrosynthesis implemented in the AiZynthFinder software in Sections 4 and 5.
Fig. 2 – Procedure to extract reference routes from a curated subset of the USPTO database. Details of the procedure is outlined in Section 2. |
We extracted 1046088 routes from 80639 patents using the method of Mo et al.,26 provided as a script§ but with an increased timeout for finding a route from 6 to 10 s. The script puts all the reactions from a patent in a reaction network (see an example in Fig. 3), identifies molecules that only exist as a product as the starting point for synthetic routes, and then uses a depth-first search to extract the routes. After extracting the routes, we immediately discarded all routes with a single leaf, to avoid uninteresting transformation sequences, resulting in 158698 routes. The distributions of the number of molecules, leaves, reactions, and longest linear route (LLR) are shown in Fig. 4. The dataset is tilted towards short routes with few leaves. Only 4.7% of the routes are convergent; the remainder are linear. We believe the 150 K routes are too extensive to benchmark route predictions and it is likely that many routes from the same patent are similar. Therefore, we processed the dataset further, although we acknowledge that the 150 K routes could be used in the future for e.g., machine learning tasks, and thus we provide them for downloading.
Fig. 3 – Reaction network extracted from the US03983174 patent. Reactions are represented by solid circles and connects reactants to the left with the product to the right. The molecules with a green frame highlights one of the possible synthetic routes for 2-chloro-N-(2,6-diethylphenyl)-N-(2-oxoethyl)acetamide and corresponds to the route in Fig. 1. |
We randomly selected n routes from each patent and then we performed an overlap check: no route should have leaf molecules that exist in another route as intermediate precursors (non-leaves), and the target molecule should not exist as an intermediate in another route. The motivation behind this check will be explained below. We then discarded all routes with a depth of more than 10 reactions, to exclude a few really long routes that will require an extensive search (this excludes less than 100 out of the 150 K routes, possibly reflecting USPTOs patent oriented chemical scope without extensive coverage of complex syntheses). For the non-overlapping routes with a depth of at most 10 reactions, we selected the 10000 most diverse routes: the pair-wise distance matrix was computed using the machine learning approach previously described28 and then a greedy search with maxmin criteria was used to select the routes. We created two such sets of 10000 routes one where either one or five routes were extracted from each patent before checking for overlap (n = 1 or n = 5). We will refer to these two sets as set-n1 and set-n5, respectively. The distributions of the number of leaves, molecules, reactions, and longest linear route (LLR) are shown in Fig. 4. For set-n1, the distributions are closer to the full set of routes: the number of molecules in the routes are typically small and the number of longer routes is quite small. On the contrary, in set-n5, the distributions are shifted to the right and there are more of the longer routes. There are 3.0% and 6.9% of convergent routes in set-n1 and set-n5, respectively. We classified the reactions in the reference routes using the NextMove software29 and analyzed the occurrences of the difference classes. This is shown in Fig. S1,† and we can see that the most common reaction classes are deprotections, n-acylation to amide, O-substitution and other functional group interconversions. It is clear that the reference routes contains predominately common chemistry, a reflection of the USPTO database. Furthermore, the reference routes represent similar chemistry as the reactions in the popular UPSTO-50 K subset,17 although we have not excluded reactions that cannot be classified by NextMove.
Another factor affecting the search algorithm is of course the capability of the one-step retrosynthesis model. We have previously released a model trained on the entirety of the USPTO dataset,21,27 but this model was trained on reactions found in the reference routes, making it biased. Using this model in the search, or indeed any one-step model trained on the reactions in the reference routes will be a mix of neural network recommendations and what amounts to literature-lookup. However, we can train a new one-step model on the data that is not found in the reference routes, and because we left out three reactions per reaction template before extracting the routes, we have sufficient data to train a template-based one-step model. In Section 4, we will detail the training of such a model. If someone wants to extract another set of templates or train a template-free one-step model, one can always perform such modeling as well on the provided subset of USPTO. What could be interesting to consider in the future is to make a time-split of the original data, e.g., extract reference routes based on reactions dating before the cut-off year and then training one-step models on reactions after the cut-off year.
•A subset of the USPTO database with reactions that can be used to train a one-step model.
•∼150 K routes extracted from the USPTO database, which can be used for machine learning tasks.
•Set-n1 consisting of a diverse set of 10000 routes which show a similar distribution in the number of molecules and reactions as the 150 K routes.
•Set-n5 consisting of a diverse set of 10000 routes that are longer and enriched in convergent routes.
•Stock-n1 consisting of the 13633 leaves molecules in set-n1 and should be used as a stock together with set-n1.
•Stock-n5 consisting of the 13783 leaves molecules in set-n5 and should be used as a stock together with set-n5.
The USPTO dataset, reference routes, and stocks are available as open-source together with the scripts used to create the reference routes.
We call the benchmarking set the PaRoutes (patent routes) framework. It is a framework for benchmarking route predictions because we acknowledge that not all methods are built the same. For instance, we have herein benchmarked algorithms using template-based one-step retrosynthesis methods, but there exist a plethora of alternatives. Therefore, researchers should be able to pick the parts of the framework that is applicable to their methods. For example:
•If your one-step retrosynthesis model is based on templates, you can re-train it using the provided USPTO subset, which includes RDChiral-derived templates.30
•If you want to derive templates yourself, or if you are using a template-free one-step model, you can use the atom-mapped reaction SMILES from the provided training set.
•If you have your own set of curated reactions that you want to train your one-step model on, you have to make sure to first exclude reactions that are also in the reference routes.
•If your one-step model is an expert system not derived from big-data extraction, you should be able to use the provided targets and stock, and not consider the provided reaction datasets.
Because many of the existing one-step models were trained on reactions included in the reference route, we acknowledge that many of those models need to be re-trained to fully exploit PaRoutes. However, some of the adaptions needed with the current version of PaRoutes would complicate the comparison, an issue we will discuss more extensively in Section 5.5.
Although speed and number of solved targets are interesting metrics, they tell nothing about the quality of the predictions. Sometimes the quality of the routes has been quantified as the length of the routes,24,31 the mean chemical complexity,32 or a metric based on the priors from the one-step model.24 Quantifying the quality of a route by simply the route length is particularly misleading as it is very easy to envisage a shorter alternative to a route by for instance removing all protection chemistry thus rendering the route chemically infeasible. An interesting approach taken by Shibukawa et al.32 was to sort the molecules in the routes by molecule weight and then compute the pairwise Tanimoto distances of a molecular fingerprint. Naturally, the sorting of the molecules destroys the order of the reactions in the route. We propose to use a tree edit distance (TED) method,33 which is a graph-theoretical method that recursively applies cheminformatic similarity calculations on a pair of routes to determine the similarity. By sorting the predicted routes and then computing the TED between the predictions and a reference route, one can find at what position the ground truth, i.e. the reference route is found. By doing this over all the 10000 targets in the reference set one can compute top-n accuracies, just as is standard when comparing one-step retrosynthesis methods. The top-n accuracies is a metric to show how well a search method is in recovering the reference route. Caveats with this definition of accuracy is discussed in Section 5.5.
In order to compute accuracies, we need to rank the predicted routes and we propose using the recursive route score by Badowski et al.34 with artificial costs of leaf molecules. For intermediate molecules in a route, the cost is defined recursively as
cost(m) = minr∈pred(m)cost(r) | (1) |
(2) |
Finally, we propose a metric to quantify the diversity of the predictions. We argue that an algorithm producing more routes is not necessarily better than an algorithm producing fewer routes, because the routes can be highly similar. To compute the diversity, we compute the pairwise distance matrix of all the predicted routes using the fast machine learning method previously described28 that replaces the expensive TED calculations with a fast LSTM-based (long short-term memory) model, and then we use hierarchical clustering to group the routes. We optimize the number of clusters using the Silhouette method35 and the optimal number of clusters is viewed as a metric of diversity.
To summarize, we suggest applying the following metrics when comparing route prediction methods on the 10000 reference targets:
•Average search time to reach convergence in the number of solved targets.
•Number of solved targets.
•Top-1, top-5 and top-10 computed with the TED method.
•Number of route clusters.
Scripts to compute these route quality and diversity metrics are available open-source on GitHub and are considered to be part of the PaRoutes framework. We consider the calculation of timings and if targets are solved to be part of the software producing the routes. In order to compute the quality and diversity metrics, the routes need to be exported in a tree-like structure (in JSON format) with minimal features. Basically, the relationships between molecules and reactions need to be defined, molecules are defined by their SMILES string and reactions are featureless.
Although the algorithms that we will use the predict routes are representative of different classes of search algorithms, we acknowledge that our implementations are variants of the original implementations described in the literature. For instance, there are many hyper parameters settings that could be explored. As such, the benchmarks presented herein should serve as an illustration of the PaRoutes framework and provide an insight into the capabilities of the AiZynthFinder software.
Search method | Route set | Solved targets | Search timea | First solution timea | One-step model callsa | Template applicationsa |
---|---|---|---|---|---|---|
a Averages over all targets. Each route set consist of 10000 targets. The times are wall-times of a single CPU. | ||||||
MCTS | Set-n1 | 9714 | 303.3 | 8.6 | 3355.6 | 8658.2 |
Set-n5 | 9676 | 365.7 | 11.7 | 3615.3 | 8953.0 | |
Retro* | Set-n1 | 9726 | 300.7 | 7.0 | 497.4 | 24281.1 |
Set-n5 | 9703 | 349.2 | 10.5 | 498.0 | 24322.5 | |
DFPN | Set-n1 | 8475 | 347.3 | 43.0 | 404.5 | 19503.2 |
Set-n5 | 7382 | 297.9 | 53.2 | 414.5 | 19957.6 |
Because of the different amounts of one-step calls and template applications performed at each iteration, it is hard to benchmark the search methods based on the search time. However, when focusing on the time it takes to find a solution, DFPN is clearly outperformed by MCTS and Retro*. This can also be seen in the average time to find the first solution in Table 1, where one can observe that DFPN is approximately 6 times slower than MCTS and Retro* in finding the first solution. Analyzing the predictions on the set-n5 routes, we find that for these targets, the algorithms are generally slower finding the first solutions than searches on the set-n1, indicating that these compounds are more complex.
With respect to the comparison of MCTS and Retro*, there are some implementation details to consider. In the paper introducing Retro*, it was argued that Retro* was advantageous over MCTS because it made fewer calls to the one-step model.24 However, in our implementation of MCTS, the template application is deferred until it is necessary, heavily reducing the number of template applications. Thus for the template-based one-step models used herein, we observe that MCTS and Retro* are approximately equally fast to reach convergence in the number of solved routes. However, it is becoming more and more popular with template-free one-step retrosynthesis models, e.g. transformer-like architectures, that perform a translation task that effectively combines the template recommendation and template application tasks of a templated-based one-step model. Such machine learning (ML) models are typically much slower than the simple feed-forward architecture used in the template-based model. Therefore, using a more expensive ML model or template-free methods that do not need to apply a template would tip the speed advantage in the favor of Retro*.
As discussed in Section 2, the distributions of the route shapes for the set-n1 follow closely the distributions for the full set of routes extracted from the patent data. However, we also extracted a set of routes that are enriched in longer and convergent routes, which presumable should be a greater challenge to the search algorithms. For all three search algorithms, we indeed observe a decrease in the number of solved targets when comparing the predictions made on the set-n1 routes and the set-n5 routes (see Table 1). Encouragingly, the decrease is typically rather small, between a fraction of percentage and a few percentages, for MCTS and Retro*. This shows that even though these algorithms require some more time to find a solution, they are capable of breaking down the compounds to starting material in stock. However, for DFPN, the decrease in the number of solved routes is on the order of 10%, a significant amount. This shows that DFPN not only is inferior to MCTS and Retro* in finding solutions but that its capabilities worsen with compound complexity.
In ESI, Table S1,† we make a cross-comparison of the different search methods on the set-n1 routes to find if they are complementary or not. We find that MCTS and Retro* find solutions to the same targets, and there is a practically negligible fraction of targets for which Retro* only finds a solution. Furthermore, DFPN solves no unique targets but both MCTS and Retro* find solutions to targets for which DFPN finds no solution.
Search method | Route set | Accuracy | Shorter routea | Leaves overlapb | Routes extractedc | Number of clustersc | ||
---|---|---|---|---|---|---|---|---|
Top-1 | Top-5 | Top-10 | ||||||
a The average number of targets for which a shorter route than the reference route is found in top-1. b The average leaves overlap in top-1. The leaves overlap is the maximum overlap between the leaves in a predicted route and the leaves in the reference route. c Medians over all targets. | ||||||||
MCTS | Set-n1 | 0.20 | 0.55 | 0.61 | 0.44 | 0.68 | 273 | 68 |
Set-n5 | 0.09 | 0.34 | 0.42 | 0.59 | 0.62 | 272 | 77 | |
Retro* | Set-n1 | 0.17 | 0.48 | 0.54 | 0.44 | 0.68 | 264 | 68 |
Set-n5 | 0.08 | 0.30 | 0.38 | 0.61 | 0.63 | 149 | 39 | |
DFPN | Set-n1 | 0.19 | 0.33 | 0.33 | 0.45 | 0.63 | 6 | 2 |
Set-n5 | 0.08 | 0.14 | 0.14 | 0.65 | 0.55 | 6 | 2 |
To compare the different search algorithms further, we performed a cross-comparison: given a pair of experiments: we calculated the percentage of targets for which the reference route was identified by both methods, by just one method or by neither of the method (see Table 3). Comparing MCTS and Retro*, we see that both find the reference for 48% of the targets, and for 29% of the targets, neither method is able to find the reference route. Interestingly, the methods are not entirely complementary because only MCTS finds the reference routes for 15% of the targets, whereas only Retro* finds the reference routes for 9% of the targets. Similar observations can be made when comparing the other pair of search methods: both MCTS and Retro* outperform DFPN in uniquely finding reference routes to varying degrees. In Fig. 7, we present the routes for a target for which the reference route is found by MCTS but not Retro*, and in ESI† we present a few similar analyses. In the routes presented in Fig. 7, we can see that the top-1 routes for both MCTS and Retro* start similar and share the first three steps. However, in the reference route, the final steps are a sulfonylation followed by an SNAr reaction. This is also among the extracted predictions from MCTS but not for Retro*. Instead, an alternative route is predicted with the SNAr first and the sulfonylation second, and this is the only top-1 route predicted by Retro*. This is a route also produced by MCTS, showing that this algorithm is able to extract both solutions. It is not clear why Retro* did not recover the reference route because the template corresponding to the SNAr was suggested by the one-step model as evident from the MCTS algorithm. However, something in the prioritization of nodes to expand deemed that path unpromising, continuing with the path following from the sulfonylation template.
Method 1 | Method 2 | Found by | |||
---|---|---|---|---|---|
Both | Method 1 | Method 2 | Neither | ||
MCTS | Retro* | 0.48 | 0.15 | 0.09 | 0.29 |
MCTS | DFPN | 0.33 | 0.30 | 0.01 | 0.37 |
Retro* | DFPN | 0.27 | 0.30 | 0.07 | 0.37 |
To further investigate the rather low accuracies, we did two additional analyses that are included in Table 2. First, we calculated the fraction of targets for which a shorter route than the reference route is found in top-1. For all three methods this fraction is around 0.45, showing that all methods were able to find alternative routes that are shorter for close to half of the targets. Second, we calculated the maximum leaves overlap between the leaves of the top-1 ranked routes and the leaves of the reference route. An overlap of 1.0 would imply that the predicted routes have exactly the same leaves as the reference route but the reactions are not necessarily in the same order. We observe that the average leaves overlap is between 0.63 and 0.68 for the different experiments, indicating a high overlap. In fact, the leaves overlap is one for approximately 1/3 of the targets in all experiments, which is significantly higher than the top-1 accuracies. This shows that the different search methods were able to identify routes similar to the reference routes, but with some of the steps interchanged.
The accuracies for the predictions on the set-n5 routes are significantly lower, with the best-observed top-1 around 0.10 and the best top-10 around 0.42 (see Table 2). This shows more than the fraction of solved routes that set-n5 is a challenging set for any route prediction method. MCTS slightly outperforms Retro*, but it is unclear if the differences are practically relevant. There is a trend that the fraction of targets for which a shorter route than the reference route was found in top-1 is greater than for the set-n1 routes. And it is also clear that the leaves overlap in top-1 is decreased compared to the set-n1 routes. What this analysis show is that all of the methods are in need of improvements when it comes to finding long and convergent routes. All of the three search methods can predict shorter routes, which in Table 1 is shown as a high fraction of solved targets, but they struggle to recover the human-like reference routes.
In the future it would be interesting to study the top-n accuracies as a function of search time, which would answer how long search one has to perform to find a good route. However, such a calculation would be more technically challenging to implement and therefore we did not include this in the first version of PaRoutes.
As argued in Section 3, it is not only the number of found routes that is important but rather the diversity of routes. To measure this, we find the optimal number of clusters formed by the predicted routes. MCTS seems to fare well in this comparison with a median of 68 for the set-n1 and 77 for the set-n5. Retro* is as successful in finding a diverse set of routes for set-n1, with a median of 68. However, for the set-n5 the diversity is lower with a median number of clusters of 39. Because DFPN gives a low number of routes it is not surprising to see that the median number of clusters is two for both sets of routes. As seen in Fig. 8 the distribution of the optimal number of clusters is also skewed. First, there is a sizable portion of the targets for which a relatively small number of clusters (<10) are found and this is true for all three search methods. Then there is a roughly equal portion of targets for which the number of clusters is more than 10 for MCTS and Retro*. Finally, for Retro* we find a few targets with more than 500 clusters.
With any benchmark using quantitative metrics, it is easy to construct adversarial strategies that artificially scores well. There is also the risk that the retrosynthesis methods starts to follow Goodhart's law that states that when a measure becomes a target, it ceases to be a good measure. For instance, one could train a one-step retrosynthesis on a set of reactions artificially made very similar to the reactions in the reference routes, and a multi-step retrosynthesis algorithm using such a one-step model would be artificially biased. We provide a defined subset of USPTO that is lacking the reactions in the reference routes. Of course there are similar reactions in the provided dataset, but it would be difficult to recover the reference route with a one-step model trained on a completely different part of reaction space, so some similarity is warranted. We therefore suggest to investigate reaction-space overlap and develop a robust metric for quantifying reaction dataset similarity. Furthermore, reinforcement learning could be used to overfit to the provided stock molecules and thus provide an advantage for the sake of the benchmark. In this scenario, PaRoutes could be used to measure how well the reinforcement would work, before retraining on the specific stock database of interest.
We focus PaRoutes on a narrow search problem: finding a single reference route for each target using a narrow set of stock molecules. Naturally, other routes could be as effective or feasible as the reference routes, but the only way to determine that is to perform the synthesis in the lab, which is not practically feasible for large numbers of routes. Because the routes are extracted from patents they incorporate human selection of feasible disconnections and the order of those steps, and thereby represent one possible route. We argue that an effective search algorithm should be able to find this one possible route and we therefore limit the search problem to recovering the one route for each target that we have knowledge about. Similarly, the definition of stock would naturally narrow the search scope and potentially prevent the finding of alternative, more creative, routes to alternative stock molecules. But because we cannot evaluate alternative routes, the narrow search is still meaningful. In the future, we could potentially extract more than one route for each target from USPTO, and extend the accuracy calculations to encompass these routes. However, these set of routes will still not be a complete set of feasible routes for each target.
We regard a limited reaction space; the selection of the routes were filtered to only consist of reactions that have a template extracted that are seen more than three times. This step was necessary as a noise filtering step to avoid having too many reactions with errors in the training and test-reactions, such as reactions with missing or wrongly extracted molecules or wrong atom-mappings. More robust template extraction or atom-mapping could be developed in the future, which would change the number of reactions possible to include. Nevertheless, PaRoutes would be biased towards this selection of reaction and therefore could favor retrosynthesis methods that is trained on these type of reactions and disfavor other methods. For instance, a template-free one-step retrosynthesis model could in principle be trained on the full reaction space and therefore could potentially predict a more diverse set of routes. However, as discussed above, we have no way to assess the feasibility of alternative routes and we have narrowed the search problem to a recover one known feasible route. Therefore, a potentially benchmarking scenario is to run two sets of retrosynthesis experiments, one where a template-free model trained on the full reaction space is used and one where a template-free model train on the limited subset of USPTO is used. In both set of experiments, the multi-step method would be the same and thus the benchmarking would reduce to a study of the effect the one-step model. On the other end of the scale of reaction space coverage, one could imagine an expertly curated one-step model with a limited coverage. In this scenario, the retrosynthesis method could fail in recovering the reference route because it doesn't have the necessary routes. We envisage that a metric showing how well the one-step model recovers the reactions in the reference routes can be added to PaRoutes in the future.
To illustrate the use of PaRoutes, we have carried out route prediction experiments in order to benchmark three different search algorithms implemented in the AiZynthFinder software. We can draw some conclusions from these experiments: our implementation of DFPN is clearly inferior to MCTS and Retro*. DFPN finds fewer routes, solves fewer targets, and recovers to a lesser degree the reference routes. Further investigations are necessary to find the root cause of this, but it is clear in its current implementation, DFPN is not recommended. It should be mentioned that there are several implementations of proof-number search suggested in the literature, but all except one was designed to find a solution and then stop. The comparison of MCTS and Retro* is less clear-cut: both search algorithms solve roughly the same number of targets in roughly the same time. With respect to route quality, MCTS slightly outperforms Retro* especially for the set-n1 routes. Because both algorithms find approximately the same number of targets this shows that Retro* finds different solutions compared to MCTS. Which route would be the most chemically feasible is impossible to determine with certainty without doing experiments. Therefore, until we have a method that practically can evaluate multiple routes for a target, one can do what we are suggesting in here, which is to measure to accuracy of the search algorithms in finding the route there is knowledge about. It is worth emphasizing that all methods struggled with the set-n5 routes. Therefore, identifying and exploring human-like routes for complex targets that require longer synthetic routes seems to be an outstanding problem to solve.
PaRoute was developed and released open-source with the vision to create a community standard for benchmarking route predictions. One-step retrosynthesis is chiefly benchmarked using patent data, and we envisage that PaRoutes will be developed into the analog for multi-step methods. We believe that the lack of comparative studies of route predictions hampers the development and is detrimental to the transparency and reproducibility of published research. We acknowledge that PaRoutes in its current version does not solve all problems associated with comparing route predictions such as how to deal with multiple alternative routes or algorithms with different prerequisite for training data, but we argue that the framework targets an unmet need. Furthermore, we hope that PaRoutes can evolve with community contributions; for instance, we envisage that other quality metrics will be included in the framework when such methods are developed. By benchmarking our search algorithms with diverse metrics we can as a community truly understand the subtle differences and limits of the current state-of-the-art of computer-aided retrosynthesis planning and pave the way for novel developments that have the potential to impact molecular design campaigns.
Footnotes |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d2dd00015f |
‡ https://downloads.emolecules.com/free/. |
§ https://github.com/moyiming1/Retrosynthesis-pathway-ranking. |
¶ https://github.com/binghong-ml/retro_star. |
This journal is © The Royal Society of Chemistry 2022 |