Yash
Chainani‡
ab,
Zhuofu
Ni‡
ab,
Kevin M.
Shebek
ab,
Linda J.
Broadbelt
ab and
Keith E. J.
Tyo
*ab
aDepartment of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA. E-mail: k-tyo@northwestern.edu
bCenter for Synthetic Biology, Northwestern University, Evanston, IL, USA
First published on 2nd November 2024
Retrobiosynthesis tools harness the inherent promiscuities of enzymes for the de novo design of novel biosynthetic pathways to key small molecules. Many existing pathway search algorithms rely on exhaustively enumerating the space of all possible enzymatic reactions using generalized rules, followed by an extensive analysis of the ensuing reaction network to extract candidate pathways for experimental validation. While this approach is comprehensive, many false positive reactions are often generated given the permissiveness of such reaction rules. Here, we have developed DORA-XGB, a enzymatic reaction feasibility classifier. DORA-XGB can be used within our DORAnet framework to assess whether newly enumerated enzymatic reactions and pathways would be feasible. To curate a training dataset for our model, we extracted enzymatic reactions from public databases and screened them for their general thermodynamic feasibility. We then considered alternate reaction centers on known substrates to strategically generate infeasible reactions with high confidence, thereby circumventing the lack of negative data in the literature. In training our model, we also experimented with various molecular fingerprinting techniques and configurations for assembling reaction fingerprints, taking into account not just primary substrate and primary product structures, but cofactor structures as well. Our model's utility is demonstrated through favorable benchmarking against a previously published classifier, the successful recovery of newly published reactions, and the ranking of previously predicted pathways for the biosynthesis of propionic acid from pyruvate.
Design, System, ApplicationRetrobiosynthesis tools aid in elucidating novel pathways for the sustainable biomanufacturing of small molecules. Such tools, however, may suggest many false positive reactions that are far too dissimilar from the canonical reaction/s that a given enzyme is known to catalyze, thereby demanding unrealistic extents of enzyme promiscuity. Here, we aim to reduce false positive predictions and enhance the accuracy of retrobiosynthesis tools by developing a machine learning model to reliably predict the feasibility of proposed enzymatic reactions. In designing this model, we innovated around the lack of infeasible reactions in the literature by introducing the concept of “alternate reaction centers”. These are functional groups that despite being identical to the catalyzed moiety on a substrate, remain uncatalyzed in a reported reaction. Our novel hypothesis enables us to strategically infer infeasible reactions from known positive reactions with higher confidence than previous approaches which assume any unseen reaction to be infeasible. After synthetically generating negative data from known reactions, we trained a supervised learning classifier and optimized it via a Bayesian hyperparameter optimization approach. Our model can be instantly dropped into pathway discovery workflows and even further improved upon in the future by incorporating additional features, such as enzyme sequence data. |
To this end, retrobiosynthesis tools can accelerate the de novo design of biosynthetic pathways to key products in an automated manner that circumvents costly trial-and-error based experiments.9–14 Such tools, including our in-house platform, DORAnet (formerly Pickaxe v2.0 (ref. 15)), typically use reaction rules or templates to recursively transform simple precursors, such as glucose or glycerol, into downstream metabolites of interest. These reaction rules, in turn, digitally encode for the potential promiscuities of enzymes by searching for substructure matches between native and non-native, predicted substrates.16,17 Despite their comprehensiveness, rule-based algorithms can generate many false positive reactions demanding unrealistic extents of enzyme promiscuity.
Throughout this work, we define positive reactions as those in which a moiety or reaction center, e.g., a carboxylic acid group, on a substrate successfully undergoes an enzyme-catalyzed reaction, such as decarboxylation. Consequently, false positives refer to predicted reactions within which a moiety that is known not to be transformed, thus representing a negative reaction, is incorrectly transformed by a reaction rule. Such false positives can occur if the substructure match required by a given rule only spans a small chemical neighborhood around a substrate's reaction center. For instance, our previously published JN1224MIN generalized rules16 predict enzyme promiscuity by considering only the reactive moieties present on a substrate and not its surrounding chemical groups, which may still influence catalysis due to their steric or electron donating/withdrawing effects. Our subsequently upgraded intermediate rules (available at https://github.com/tyo-nu/MINE-Database) incorporate some chemical context around reaction centers but still result in large metabolic in silico network expansions (MINEs).
Although such large networks ensure that the space of all possible reactions has been adequately explored, the high false positive rate arising from the permissible nature of reaction rules often results in far more pathways than can be thoroughly analyzed. This impedes the selection of promising pathways for experimental validation by users. DORAnet and other retrobiosynthesis tools would therefore benefit from the development of an automated reaction feasibility filter to elucidate only the most feasible and realistic of reactions suggested within a network expansion. While a variety of chemical similarity and molecular weight filters already exist within DORAnet to prune MINEs on-the-fly, these do not evaluate the feasibility of reactions generated and cannot be used to rank pathways once a network has been created.
To develop a filter or a model that can classify predicted reactions as feasible or infeasible, both positive and negative examples are needed. While thousands of observed reactions have already been recorded in BRENDA,5 KEGG,4 and MetaCyc,6 data for infeasible reactions are rare. A common approach in overcoming this lack of negative data is to assume unreported reactions as negative. This assumption can certainly aid in synthetically generating negative examples but has two key drawbacks. First, assuming that any unreported reaction must necessarily be infeasible can lead to generating negative reactions that are so strongly dissimilar to known, positive reactions (Fig. 1(a)) that the resulting classifier will suffer from a high degree of uncertainty in trying to predict the feasibilities of reactions with more intermediate degrees of similarity. The ideal distribution of positive and negative reactions should instead meaningfully delineate the boundary of reaction feasibility, allowing the sampling of negative data from more confident infeasible reactions only while positive data is sampled from known reactions. This would enable a classifier trained on such examples to also generalize well to real-world reactions wherein positive and negative examples are not so dissimilar. Moreover, the ‘unreported is negative’ assumption may lead to mislabelling understudied reactions that could very well be feasible, but because they have yet to be studied and/or published, are labeled as infeasible (Fig. 1(a)). Such mislabelling would introduce false negatives in a training set and also defeats the purpose of retrobiosynthesis tools, which will inevitably suggest new reactions. The explicit recovery of unreported profiles using methods such as collaborative filtering has been discussed in the literature,18,19 but the current space of substrates for any given enzyme is so limited that these methods may not apply to most classes of enzymatic reactions. Still, if negative reaction data can be reliably obtained, many artificial intelligence algorithms have been shown to be effective at demarcating complex decision boundaries in binary classification tasks across a diversity of domains, from predicting antimalarial bioactivity20 to the segmentation of coal mining faces.21
Here, we address this lack of negative data in the literature by proposing the stricter “alternate reaction center” assumption, which enabled us to strategically and more confidently infer negative reactions that more closely resemble positive reactions (Fig. 1(b)) and used our dataset to train a supervised reaction feasibility classifier. We define a reaction center as the group of atoms in a substrate that directly participates in a reaction. Rather than treating all unreported reactions as negative datapoints, we posit that if an enzyme is observed to catalyze the transformation of a particular chemical moiety on a substrate but not that of other, identical moieties (alternate reaction centers) on the same substrate, then the transformation of those other, identical moieties represents products that could have also been formed in the same reaction but, since they were not observed, are infeasible products (Fig. 1(c) and (d)). In curating positive reactions from metabolic databases, we also screened such reactions for their thermodynamic feasibility across a range of metabolite concentrations.22 Since reported reactions are often a part of broader pathways that may only be specific to certain organisms, our thermodynamic screen elucidated which reactions would truly be most feasible in a diversity of contexts. The combination of this thermodynamic screen and our proposed “alternate reaction center” assumption created a dataset to train our classifier to evaluate reaction feasibility as a function of both reaction thermodynamics and enzyme specificity. To assess the applicability of our classifier, we tested DORA-XGB across various use-cases. Our model was found to achieve a high recall on newly discovered MetaCyc6 and EcoCyc23 reactions and also outperformed another published, deep-learning based feasibility classifier24 when benchmarked against these new reactions. Moreover, DORA-XGB was able to distinguish between feasible and infeasible reactions when tested on a high-throughput metabolomics dataset.25 Finally, we implemented our feasibility classifier as a filter in the design of propionic acid biosynthesis pathways26 and were able to achieve greater than 95% reduction of infeasible compounds and reactions while still preserving meaningful predictions of the most promising pathways towards propionic acid. Altogether, we demonstrate that our enzymatic reaction feasibility classifier is generalizable across different classes of reactions and enhances the computational prediction of enzyme promiscuity towards various applications.
Here, we utilized DORAnet to synthetically generate a total of 116412 unique infeasible reactions by considering alternate reaction centers (Fig. 2(b)). These synthetically generated infeasible reactions were then pooled together with reactions previously found to be thermodynamically infeasible to give a total of 122573 negative reactions and 16642 positive reactions (ESI† Section 3.3 and Fig. S8(a)–(c)). For curated and thermodynamically feasible monosubstrate reactions of the form A + cofactors → B + cofactors, all possible products beyond B are enumerated by expanding on A using only the general rule/s onto which this monosubstrate reaction had been mapped. For curated and thermodynamically feasible multisubstrate reactions of the form A + B + cofactors → C + D + cofactors, all possible products beyond (C, D) product pairs are iteratively enumerated first from A and then from B using only the general rule/s onto which this multisubstrate reaction has been mapped. Expanding on substrates only with mapped JN1224MIN rule/s follows from our “alternate reaction center” assumption.
By contrast, to generate the “unreported is negative” dataset, we utilized DORAnet and all 1224 reaction rules to expand upon each curated substrate/s and consequently enumerate the entire space of products. This resulted in a huge space of more than 11 million negative reactions. To directly compare models trained on datasets generated under each assumption, we randomly down-sampled 116412 negative reactions from these 11 million negative reactions. All tautomeric forms of compounds involved in reactions were comprehensively enumerated as part of our dataset.
Alongside attempting various molecular fingerprinting techniques, we experimented with four different methods to arrange these fingerprints for the assembly of reaction feature vectors. Briefly, we explored arranging reactant and product fingerprints in the order of ascending as well as descending molecular weights and through simple operations, such as the element-wise addition and concatenation of reactant and product fingerprints (“add then concatenate”) or the element-wise subtraction of the sum of product fingerprints from that of reactant fingerprints (“add then subtract”). We attempt such configurations to determine the pattern of fingerprints in our reaction vectors that would yield the highest predictive performance. Since different reactions involve different numbers of species, to ensure uniformity in reaction feature vector length, we zero-padded shorter vectors to the length of the longest reaction vector in our dataset for each fingerprinting technique and configuration.
The hyperparameters of this alcohol dehydrogenase classifier were optimized on its validation set via a Bayesian optimization procedure38 with the objective of maximizing the classifier's area under its precision-recall curve (AUPRC). We opted for a Bayesian approach to tuning model hyperparameters because the large size of our final dataset necessitated an efficient and targeted search of hyperparameter space. Other approaches to hyperparameter tuning such as a grid-search or random-search would be far too exhaustive and also less effective given that in these searches, information from the model's previous performance is not used to inform the choice of hyperparameters in the next iteration.39 A Bayesian approach therefore allowed us to most efficiently balance exploration of new hyperparameter combinations with the exploitation of successful ones so as to reach optimal model hyperparameters in fewer iterations. We downloaded and used the algorithm that was freely available at https://www.github.com/bayesian-optimization/BayesianOptimization in order to perform our optimization. All regularization terms, where applicable, were included in our tuning procedure for each model so as to mitigate over-fitting. In training this classifier, reaction fingerprints are created by arranging molecular fingerprints in the order [substrate, NAD, product, NADH] for alcohol dehydrogenase reactions in the oxidation direction and [substrate, NADH, product, NAD] for reactions in the reduction direction. After this initial prototyping phase, ECFP4 fingerprints and an XGBoost model were chosen as the preferred fingerprint-architecture combination for training future models (Fig. 3).
For all classifiers, stratified train/validation/test sets were created from corresponding reaction data using an 80/10/10 split ratio. The hyperparameters of all classifiers were also optimized using the Bayesian approach described above for consistency. Moreover, for the final dataset of 139215 reactions, stratified train/validation/test splits were performed iteratively on a rule-by-rule basis for each family of reactions that had been mapped to at least one generalizable rule under JN1224MIN. This guarantees the presence of each generalized transformation across all three sets. In creating these splits, unit tests were performed to ensure that any duplicate reaction fingerprints are removed and consequently, that there is no leakage of reaction fingerprints between all three sets.
Since DeepRFC was trained on monosubstrate reactions only, even though our classifier is able to make predictions on multisubstrate reactions, only monosubstrate reactions were retained in our benchmarking set to enable a fair comparison between models. Overall, 1281 newly reported monosubstrate reactions could be confidently labelled for their thermodynamic feasibility and were used for benchmarking. For feasibility classification of a given reaction by DeepRFC, we used their stipulated threshold of 0.32 (calculated by subtracting half the predicted standard deviation from the predicted mean).
Finally, we mined an E. coli metabolomics dataset, which utilized a nontargeted approach to enable high-throughput identification of novel, underground metabolic reactions. 2799 accurate masses were identified in the metabolite cocktail of their experimental setup, and we were able to assign structures for 2578 masses that correspond to 737 unique metabolites by matching the reported metabolite names with those listed in the ModelSEED Biochemistry database (for which compounds are available at https://modelseed.org/biochem/compounds). In their work, 30 novel, unique enzymatic reactions had been experimentally discovered by 12 novel enzymes whose new functions were experimentally validated. Expanding from the cocktail of metabolites, we enumerated 16796 monosubstrate reactions that could have been catalyzed but were never observed or were not present in known EcoCyc reactions and thus could be plausibly treated as infeasible test reactions.
Here, we instead propose a novel and more strategic way of inferring negative reactions from known, positive reactions. We denote our approach to synthetically generating such negative data as the “alternate reaction center” assumption. While the “unreported is negative” assumption samples negative examples from the space of all reactions outside the corpus of reported reactions, our “alternate reaction center” assumption only samples negative examples from a smaller space of reactions for which infeasibility can be more confidently established (Fig. 1(b)). Our assumption is put into practice by considering metabolites with two or more identical molecular moieties on which an enzyme is known to transform only one of these moieties but not the other/s (Fig. 1(c) and (d)). This assumption is informed by the fact that many enzymes known to catalyze a given functional group transformation have already been validated on such metabolites. These enzymes – whether native or engineered – could have possessed a certain degree of promiscuity to catalyze the same transformation on other identical reaction centers within the same substrate. Given that no instances of reactions were observed on these alternate reaction centers, however, we can assume the conditions of such enzymatic transformations as less favorable and confidently categorize such reactions as infeasible. Further, examining alcohol dehydrogenase reactions through dimensionality reduction techniques reveals that putative infeasible products are evenly distributed amongst feasible products, thereby indicating a uniform sampling of chemical space, free of any biases (ESI† Fig. S8(a)).
First, our curated reaction set comprises reactions in both directions. This follows from our previously published JN1224MIN rule set also comprising bidirectional operators, enabling retrobiosynthesis tools, such as DORAnet, the flexibility of being used in both the forward and reverse synthesis directions. The presence of such bidirectional rules is true for other rule sets published in the literature as well.17,42 While every enzymatic reaction is microscopically reversible, not all reactions are macroscopically reversible under typical cellular conditions. A thermodynamic screen can therefore quantitatively determine which set of functional group transformations and their associated reactions are most biochemically “realistic” and energetically favorable within general cellular contexts. For instance, alcohol dehydrogenase reactions that fall under the 1.1.1.x enzyme classification (EC) number are often favorable in both the oxidation and reduction directions (ESI† Fig. S3). Monooxygenation reactions (EC 1.14.13.x), by contrast, are only favorable in the direction of oxygen consumption (ESI† Fig. S2), given the extremely high energetic barrier that needs to be overcome to form NADH and oxygen as products in the reverse monooxygenation direction. Further, instead of the threshold of that is commonly invoked, our less permissive thermodynamic feasibility threshold of allows us to truly treat each reaction as independent of its reverse. This is because when , 96.5% of the flux in a given reaction is carried in the forward direction only owing to the flux–force efficacy relationship (see Materials and methods 2.2–2.3).
Finally, our thermodynamic screen is necessary since the presence of a reported reaction in a metabolic database does not simply guarantee its general feasibility. Often, databases report enzymatic reactions within the context of a broader pathway in a specific organism, wherein multiple factors, such as enzyme concentrations,22 cellular compartmentalization,43 energy coupling to other exothermic reactions,44,45 and metabolic channeling46–48 may help to drive a reaction forward. Given the difficulty of simultaneously accounting for all of these factors, our stricter bound of thermodynamic feasibility can help determine if a reported reaction would truly be feasible outside of the context in which it was reported. Consequently, our DORA-XGB classifier is able to evaluate the feasibility of novel reactions as a function of both enzyme specificity and reaction thermodynamics (Fig. 2).
The AUPRC score is our metric of choice in this work given the considerable imbalance between positive and negative reactions in our dataset. With imbalanced data, accuracy is a misleading metric for model performance since any classifier that predicts negative by default would be largely correct anyway.49 Meanwhile, precision and recall are better at identifying the minority class but need to be evaluated at specified thresholds. AUPRC, by contrast, calculates the trade-off between precision and recall at all possible thresholds and does not overvalue negative datapoints, unlike the area under the receiver operating characteristic curve. AUPRC scores on both the validation data and the test data for alcohol dehydrogenase reactions showed similar trends (ESI† Section 4.1 and Fig. S9(a) and (b)) – hashed fingerprints (ECFP4, atom pair, MAP4) perform better than descriptor-based ones (MACCS, Mordred) for all machine learning architectures. This could be attributed to the automated and unbiased nature of hashed fingerprints accounting for diverse structural information. Considering hashed fingerprints only, all architectures are found to perform comparably well, but tree-based ensemble architectures were the fastest to train given their parallelizability. XGBoost in particular comes built in with both L1 and L2 regularization, enabling it to better mitigate overfitting and generalize well beyond the training data.37
As a result, we chose XGBoost and ECFP4 fingerprints as our final machine learning architecture-fingerprint combination. ECFP4 fingerprints are valuable because they highlight local molecular substructures within a set diameter. Atom pair fingerprints, meanwhile, are able to highlight more distant pairs of atoms. Both fingerprints were found to perform well when we prototyped the alcohol dehydrogenase classifier (ESI† Fig. S9(a) and (b)) and could have been used to provide equally important insights into the feasibility of novel reactions. Other studies have also used a combination of both fingerprints to represent chemical space in an unbiased manner.34 To create reaction feature vectors for this initial study, all participating molecular structures were first converted to ECFP4 fingerprints with 2048 bits. Then, molecular fingerprints are arranged in the order [substrate, NAD, product, NADH] for alcohol dehydrogenase reactions in the oxidation direction and [substrate, NADH, product, NAD] for reactions in the reduction direction.
In order to analyze reactions with multiple substrates or products, decisions have to be made on how to combine the feature vectors for each molecule on the left and right hand side of the reaction before they are fed to XGBoost. Four configurations in which to arrange molecular fingerprints along reaction feature vectors were used in this study (ESI† Section 4.2 and Fig. S10). Briefly, these involved arranging fingerprints in ascending and descending molecular weights of the corresponding species as well as through simple element-wise addition or subtraction of molecular fingerprints. The “add then concatenate” fingerprint approach described earlier (see Methods and materials 2.6) yielded nominally better results; therefore, we used the “add then concatenate” approach going forward.
With this reaction fingerprint configuration, we considered whether to build multiple individual classifiers specific to each class of enzymatic transformation or to build a single, consolidated model that would be generalizable to all reaction classes. Since 64.3% of our curated reactions are covered by 33 distinct types of enzymatic transformations, we built an individual XGBoost classifier for each of these 33 transformation types and then compared the performance of these models against a single, consolidated XGBoost model trained simultaneously on the data from all 33 transformation types. The overall AUPRC score of our consolidated model of 0.92 was found to exceed the average AUPRC of 0.86 across 33 individual models (ESI† Fig. S12). This indicates that the diversity in training data and its quantity are instrumental to the performance of our consolidated classifier. Further, given the overhead required in separately training and applying 33 different models based on an input reaction class, we chose to build a single, comprehensive classifier applicable to all reaction types.
Our consolidated model performs well with a high AUPRC score of 0.92, precision of 0.81, recall of 0.87, and F1 of 0.84 (Fig. 4) when these metrics were computed on our test set. The recall, precision, and F1 scores reported were calculated at an optimum threshold of 0.593. This threshold was in turn determined by considering 100 linearly spaced thresholds between 0 and 1 and then selecting the threshold that yielded the highest F1 score.
Similar to the processing of our training set, these newly reported reactions were considered in both directions and screened for thermodynamic feasibility. On all 1281 newly-reported monosubstrate reactions for which thermodynamic feasibility could be confidently established, DORA-XGB outperforms another published deep learning method, DeepRFC (Fig. 5). Negative reactions for DeepRFC's training were generated with the “unreported is negative” assumption. The higher performance of DORA-XGB against this other classifier as well as our in-house “unreported is negative” dataset (ESI† Section 4.6 and Fig. S14) thus underscores the utility of our “alternate reaction center” hypothesis.
Interestingly, the performance of DORA-XGB across all metrics drops between our test set and this external benchmarking set. This decline is due to two reasons. First, the distribution of enzymatic transformations in our benchmarking set is inherently different from that in the training, validation, and testing sets. This is expected given that the benchmarking set typically represents an out-of-distribution sample anyway. Crucially, since reactions in the benchmarking set were published later, there is a bias towards transformations that are rarely seen in the training, validation, and testing sets. Given that DORA-XGB had fewer opportunities to confront such rarer transformations during training, its performance understandably declines in benchmarking (ESI† Section 4.9, Fig. S16 and S17).
We then also used DORA-XGB to predict underground metabolism in an E. coli nontargeted metabolomics dataset.25 Promiscuous enzymatic activities often lead to underground metabolism, which are undocumented reactions or those without enzyme annotations, existing even in well-studied organisms. While current genome scale models fail to capture the entirety of such metabolism, computational resources such as our MINE database (publicly available at https://minedatabase.ci.northwestern.edu) have been developed to exhaustively enumerate possible reactions in E. coli. Distinguishing true positive novel reactions from the enormous space of computationally generated reactions, however, remains a challenge. In this E. coli metabolomics experimental setup, 30 newly discovered monosubstrate reactions were observed and validated as enzyme concentrations were tuned.25 Since these enzymes were extensively tested on native metabolites in E. coli, any undiscovered reactions that could have occurred on these metabolites that follow from the above transformations were labelled as plausibly infeasible reactions. Thus, we generated a dataset consisting of reactions that are 1) feasible and known, 2) feasible and novel, and 3) plausibly infeasible. We applied our feasibility classifier, and our model was able to recover 28 out of 30 newly discovered novel and feasible reactions (ESI† Table S1A), and on plausibly infeasible reactions, our model predicted 12372 out of 16796 reactions as infeasible. Our recovery of novel feasible reactions indicated that the model could successfully retrieve experimentally validated underground reactions. For prediction of infeasible reactions, despite lower performance than on the test data, the classifier could still assist efforts of filtering out numerous implausible reactions and prioritizing the most feasible underground reactions to be discovered with more extensive experiments.
Our updated DORAnet platform has implemented several on-the-fly filters to remove reactions in each generation that may lead farther away from the intended target molecule. One of these is a Tanimoto similarity filter with manually defined similarity cutoffs. With this, molecules that are too structurally dissimilar from the target are discarded after each generation. Such filters can improve pathway search efficiencies but do not speak to the feasibility of reactions that constitute a found pathway. Our reaction feasibility classifier could therefore be used to predict the feasibility of novel reactions and quickly filter out false positive reactions that demand unrealistic extents of enzyme promiscuity while still preserving molecules that can be feasibly reached after each generation. As such, our classifier enhances the confidence of newly predicted reactions while also improving the computational efficiency.
To test the utility of DORA-XGB in designing novel pathways, we deployed our classifier on our previously predicted biosynthetic pathways from pyruvate to propionic acid.26 We first reproduced the set of pathways that lead from pyruvate to acrylic acid in exactly three steps, where acrylic acid was an important precursor for propionic acid pathways and just one step away from the target (Fig. 6a). Starting from pyruvate, a network expansion using all 1224 generalized rules resulted in an enormous space of more than 750000 compounds and more than 1300000 reactions over three generations.
Using our classifier, pathways with any infeasible reactions below our set threshold were filtered out. As a result, 13 out of the 15 reactions in the original publication were predicted as feasible. Reassuringly, predicted reactions that had been reported in the literature and predicted reactions that were known from the literature to be catalyzed by promiscuous enzymes both received high feasibility scores. This resulted in three out of the five candidate pathways being predicted as feasible (Fig. 6(a)). In addition to discarding infeasible pathways, all pathways were also ranked with a net pathway feasibility score. This was computed by taking the product of all constituent reactions' feasibility scores (Fig. 6(b)) along a pathway. Users can also explore other ways to aggregate reaction feasibility scores, such as taking the average of reaction feasibility scores within a pathway.
As our classifier threshold was varied, an overall 96.3% reduction of new compounds and 96.9% reduction of new reactions was achieved after three generations (ESI† Fig. S15(a) and (b)), which in turn led to significant computational efficiency. This threshold can be customized by users to achieve their desired balance between precision and recall on novel reactions. A high classifier threshold would result in more reactions filtered out, thus improving precision at the expense of recall, and vice versa. We also tested the effect of varying the threshold by exploring the entire spectrum in increments of 0.01 of classifier thresholds from 0 to 1 and observed the resulting number of feasible reactions and compounds remaining as well as the number of pathways still predicted as feasible. At thresholds as low as 0.06, DORA-XGB can begin discarding infeasible pathways while at thresholds up until 0.84, DORA-XGB was able to retain the most feasible pathways within the network before all pathways were classified as infeasible. Ultimately, researchers could take advantage of the tunable nature of DORA-XGB to strike a balance between comprehensiveness and runtime and apply the feasibility classifier as an on-the-fly reaction filter and/or pathway ranking metric based on specific applications.
In order to train DORA-XGB, we required both positive and negative data. Although infeasible reactions are rarely published, we overcome this lack of negative data through our proposed “alternate reaction center” assumption to confidently infer infeasible reactions from known, positive ones. Our assumption involves examining reported substrates with multiple identical chemical moieties wherein only one of those centers is known to undergo enzymatic catalysis and not the others. With this insight, we first screened reported reactions for their thermodynamic feasibility and then, synthetically generated negative reactions from thermodynamically feasible ones. Our publicly available DORAnet platform and JN1224MIN rule set were used to generate such negative reactions, which were then pooled together with known ones to create a comprehensive, high-quality training dataset.
We subsequently demonstrated the effectiveness of our dataset and consequently, of DORA-XGB to predict the feasibility of reactions across various scenarios. Our model was able to identify novel feasible reactions when tested against a time split reaction dataset and a nontargeted metabolomics dataset. It could also be integrated within DORAnet as a custom reaction filter, where infeasible reactions after each generation would be filtered out to accelerate pathway design and preserve only the most promising pathways for experimental validation.
To train DORA-XGB, we also explored a range of molecular fingerprinting methods to construct reaction fingerprints from molecular fingerprints. In doing so, we recognized that cofactors also mechanistically play a role in enzymatic reactions and therefore included their fingerprints alongside those of primary substrates and primary products. This enabled us to fully capture the various chemistries involved in biochemical transformations. We note, however, that while DORA-XGB can predict the generalized feasibility of biochemical reactions, it currently cannot predict how feasible it would be for a given enzyme to catalyze a query reaction. This is because DORA-XGB does not yet take any enzyme information, such as sequence or structure, into account. Future work may involve incorporating such information through sequence embeddings50,51 as well as featurizing molecules with message-passing graph neural networks52–54 rather than with molecular fingerprints so as to better capture the long-range interactions between various functional groups within molecules.
Ultimately, our aim in this work was to provide a rigorous workflow for synthetically generating negative reaction data. Using the assumptions introduced in this work, we demonstrate that with reliable data, even relatively simple machine learning models can lead to good predictive performance. We have provided our classifier as an open-source tool on our lab Github page: https://github.com/tyo-nu/DORA_XGB.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4me00118d |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2025 |