Chonghuan
Zhang‡
ab,
Qianghua
Lin‡
b,
Chenxi
Yang
c,
Yaxian
Kong
c,
Zhunzhun
Yu
*b and
Kuangbiao
Liao
*ab
aGuangzhou Municipal and Guangdong Provincial Key Laboratory of Molecular Target & Clinical Pharmacology, The NMPA and State Key Laboratory of Respiratory Disease, School of Pharmaceutical Sciences, Guangzhou Laboratory, Guangzhou Medical University, Guangzhou, Guangdong, PR China 511436. E-mail: yu_zhunzhun@gzlab.ac.cn; liao_kuangbiao@gzlab.ac.cn
bGuangzhou National Laboratory, No. 9 Xingdaohuanbei Road, Guangzhou International Bio Island, Guangzhou, Guangdong, PR China 510005
cAIChemEco Inc., Guangzhou, Guangdong, PR China 510005
First published on 5th June 2025
Amide coupling is an important reaction widely applied in medicinal chemistry. However, condition recommendation remains a challenging issue due to the broad condition space. Recently, accurate condition recommendation via machine learning has emerged as a novel and efficient method to find suitable conditions to achieve the desired transformations. Nonetheless, accurately predicting yields is challenging due to the complex relationships involved. Herein, we present our strategy to address this problem. Two steps were taken to ensure the quality of the dataset. First, we selected a diverse and representative set of substrates to capture a broad spectrum of substrate structures and reaction conditions using an unbiased machine-based sampling approach. Second, experiments were conducted using our in-house high-throughput experimentation (HTE) platform to minimize the influence of human factors. Additionally, we proposed an intermediate knowledge-embedded strategy to enhance the model's robustness. The performance of the model was first evaluated at three different levels—random split, partial substrate novelty, and full substrate novelty. All model metrics in these cases improved dramatically, achieving an R2 of 0.89, MAE of 6.1%, and RMSE of 8.0% in the full substrate novelty test dataset. Moreover, the generalization of our strategy was assessed using external datasets from reported literature, delivering an R2 of 0.71, MAE of 7%, and RMSE of 10%. Meanwhile, the model could recommend suitable conditions for some reactions to elevate the reaction yields. Besides, the model was able to identify which reaction in a reaction pair with a reactivity cliff had a higher yield. In summary, our research demonstrated the feasibility of achieving accurate yield predictions through the combination of HTE and embedding intermediate knowledge into the model. This approach also has the potential to facilitate other related machine learning tasks.
To address these challenges, researchers have increasingly turned to machine learning (ML) models to enhance reaction yield predictions and streamline the selection of optimal reaction conditions. Recent advancements highlight the transformative impact of ML in this domain.6–9 Notably, Doyle et al. pioneered the use of Random Forest algorithms in predicting yields for C–N cross-coupling reactions, showcasing ML's capability to significantly reduce experimental workloads and expedite the discovery process.10 Building on this foundation, the Sigman and Denmark groups have extended these predictive models to encompass broader reaction scopes, thereby increasing their robustness and applicability across diverse chemical landscapes.11,12 These efforts represent a substantial leap forward in the practical deployment of ML, moving beyond proof-of-concept to tools that offer real-world utility. Schwaller et al. introduced an approach using Bidirectional Encoder Representations from Transformers (BERT) neural networks13 to predict yield based on textual descriptions of chemical reactions, which leverages natural language processing (NLP) algorithms to interpret reaction transformation outcomes.14
The first amide coupling reaction yield prediction model by Isayev et al. utilized literature-based reactions curated from Reaxys15 to build predictive models, but highlighted the inherent difficulties of using such data. Literature reactions often suffer from inconsistencies in reporting, variability in experimental conditions, and a lack of comprehensive datasets, making it challenging to build robust and generalizable models.16 Additionally, literature sources typically report only successful reactions with high yields, neglecting low-yield and negative data that are crucial for creating well-distributed and accurate predictive models.17 It is essential to curate relevant datasets for model development and to identify and control factors that complicate yield prediction. The variability in data sources, reaction scales, and structural diversity reported in the literature further complicates the development of reliable models.18 Sigman et al. explored a related challenge in predicting reaction rates for amide coupling using linear free-energy relationships, emphasizing the importance of understanding underlying reactivity trends. Although their work focused on rates rather than yields, the overarching goal of providing actionable insights for the synthetic community is similar to our presented work.19 Very recently, Doyle et al. further explored the optimization of reaction conditions through the bandit optimization technique to efficiently navigate the vast chemical space, balancing the exploration of new conditions with the exploitation of known successful ones.20 This approach is particularly powerful for the case study amide coupling reaction, where multiple variables—such as reagent, solvent, and temperature—must be optimized simultaneously. By using bandit optimization, the researchers were able to significantly reduce the number of experimental trials needed to identify optimal conditions, showcasing yet another example of how ML can accelerate and enhance the process of chemical discovery. However, for new substrate pairs, we were unable to directly obtain yields under different conditions and had to perform the reactions following the corresponding workflow, which differs from the role of a yield prediction model in recommending conditions.
High-throughput experimentation (HTE)21 has emerged as a powerful alternative to traditional literature-based approaches for building reaction yield prediction models.22–24 HTE techniques generate large datasets through automated, parallelized experiments, offering a more consistent, comprehensive, and controlled data source with a broader range of reaction conditions, including low-yield and negative outcomes. This systematic approach helps in developing more robust and generalizable models. With our in-house automated HTE platform, we have successfully optimized reaction conditions, explored the reaction space, and collected standardized experimental data for machine learning studies, resulting in a series of related publications.25–31
Despite these advantages, many HTE-based models achieve high accuracy but are limited to a narrow range of substrate and reaction condition spaces. This limitation has been well documented by several research groups, including those led by Doyle, Sigman, and Denmark.6,32 Additionally, a common issue with these models is the evaluation methodology. Often, models are tested using data splits that include test substrates seen by the model in the training set, resulting in overly optimistic performance metrics. However, when evaluated using a strict test set—where the model must predict yields for entirely new combinations of substrates—the performance typically drops. This strict testing better reflects real-world applications where chemists need to predict reaction yields for novel substrate pairs. Therefore, creating a reaction dataset with diversified substrates and conditions, implementing rigorous testing protocols and curating relevant datasets are crucial for developing reliable and accurate predictive models. This issue has been widely recognized in the literature, with prior studies highlighting the pitfalls of inadequate dataset partitioning.6,32 Recent discussions in the chemical engineering field further emphasize the necessity of rigorous testing protocols to avoid overfitting and ensure model generalizability.33 The flaws in evaluation methodologies have been acknowledged in various domains, including C–N coupling yield predictions, where flaws in dataset partitioning have been publicly debated and addressed.10,34,35
In this context, we aim to build a high-quality dataset on amide coupling and develop a high-performance yield prediction model that can accurately recommend optimal conditions for novel substrate pairs in the training dataset. In this work, we first demonstrate our efforts to prepare the dataset. We selected substrate pairs according to structures reported in the USPTO reaction dataset36 and a virtual commercial available space to ensure potential application and structural diversity. Our method employs a machine-based sampling approach to systematically explore the chemical space of 70000 virtual compounds, complemented by a small number of manually selected substrates to ensure diversity and practicality. Second, our in-house HTE platform was utilized to collect data, incorporating control strategies including duplicate conditions to detect variability, repeating selected plates for consistency checks, and employing internal standards for accurate yield measurement to improve reproducibility. With the dataset in hand, we then focused on developing a robust prediction model. Given the challenge to develop a robust model under 95 conditions, we transformed our goal into an iterative prediction task across the list of 95 conditions. Meanwhile, intermediate knowledge was embedded into the model to enhance its performance. The distinguishing feature of this strategy is that the model does not need to learn the relationships among different conditions, while still retaining condition information, thereby providing better performance with high probability. The results of a series of studies revealed that the generalization ability of the model could be significantly improved after applying this strategy (Fig. 1). The model had a good performance toward a fully unseen test set from the literature, achieving an R2 of 0.71, MAE of 7%, and RMSE of 10%. Meanwhile, the model could recommend more suitable conditions for some reactions with low yield, indicating the potential application of our work. Additionally, our strategy achieved satisfactory prediction results for reaction pairs with reactivity cliffs, delivering an accuracy of 0.73 in binary classification.
We first used the USPTO 50k reaction dataset,36 encompassing 50000 synthetic reactions derived from published US patents, to compile a comprehensive dataset focusing on amide coupling reactions. To achieve this, as shown in Fig. 2a, we first composed a reaction template in SMiles ARbitrary Target Specification (SMARTS) syntax,37 following the general equation of the amide coupling reaction. We used RDKit38 to filter amide coupling reactions from the USPTO dataset, identifying 11
663 entries of amide coupling reactions.
The product SMILES strings were then converted into extended connectivity fingerprints (1024 bit ECFP)39 with a radius of 2, serving as numerical representations of the molecular structures. To manage the high-dimensional nature of the Morgan fingerprints and facilitate analysis, we employed Principal Component Analysis (PCA),40 an unsupervised learning technique, to reduce the dimensionality of the data while preserving the variance inherent in the molecular descriptors. The data were reduced to a two-dimensional space primarily for visualization purposes and to observe the overall distribution of amide coupling reactions within the chemical space. It is important to note that this dimensionality reduction was not used for clustering but rather for visualization. While PCA was chosen for this purpose, other dimensionality reduction techniques, such as t-distributed stochastic neighbor embedding (t-SNE)41 or uniform manifold approximation and projection (UMAP),42 which were often better at preserving distances in high-dimensional spaces20,43,44 were also attempted as alternatives for PCA. However, t-SNE and UMAP visualizations both exhibit a globular structure, which indicates that neither t-SNE nor UMAP can provide a uniform distribution like PCA does, as shown in Fig. S19 and S20 of the ESI.†
The derived USPTO amide coupling dataset is notable for its open-source availability, enabling reproducibility by others. We focus on the biological activity and practical applications of the product space, as shown by the grey scatters in Fig. 2b, underscoring their potential significance in various fields. However, given that many USPTO substrates are derived from patented molecules, synthesizing them can be difficult. We could only use the USPTO space to calibrate the size of the virtual product space. To develop a virtual product space for the purpose of strategic HTE substrate selection, we focused on buyable substrates. We identified amines and carboxylic acids from our in-house commercial molecule database, curated from various chemical providers. Due to the reliance on DMF as the solvent in our HTE workflow, substrates with poor solubility in DMF were systematically excluded from the dataset. While this exclusion ensures experimental feasibility, it may limit the chemical diversity of certain subclasses of acids or amines. A virtual product space of 71000 products was developed. The yellow, red and green scatters in Fig. 2b cover most of the virtual product space, with products outside the edge of the USPTO space excluded, as described in the “Virtual compound filtering” section.
The use of 10 strata provided sufficient granularity to ensure a comprehensive representation of chemical diversity while avoiding selection bias. This stratified approach also helped ensure that compounds selected for the final HTE library were diverse in their chemical features. Some of the sampled products were discarded due to their high price and complex structures, which were not suitable for quantitative analysis via NMR. The distribution of these discarded virtual products was illustrated by yellow scatters in Fig. 2b.
Using this approach, we selected 447 HTE products from the virtual product space, as shown in the “sampling-selected HTE products” of Fig. 2b. Glorius et al.17 emphasized that robust models generally require a dataset comprising at least 500 substrate combinations and sufficient diversity. We also manually supplemented 186 products with additional compounds that were cost-effective and structurally appropriate, as shown in the “expert-selected HTE products” of Fig. 2b. The expert-selected HTE products cover some extent of the space that was not addressed by the machine-selected region. Eventually, this yields a final dataset of 632 products corresponding to 632 unique substrate pairs, which include a total of 70 amines and 66 acids. This final selection of a subset of the original virtual compound space was designed to capture the chemical diversity from the entire virtual library. Our approach aimed to enhance representativeness while ensuring practicality for high-throughput experimentation.
The comparison of this chemical space indicates that our self-developed HTE reactions encompass a breadth of chemical coverage comparable to that of the virtual space, which is recognized for its extensive coverage in reaction modeling. Although it does not capture the entire chemical space historically explored by chemists,46 it represents a robust and comprehensive starting point due to its open-access nature. This facilitates the replication of our study and underscores that our strategies to select substrates are effectively aligned with ensuring practical applications, which would benefit the development of a robust model.
![]() | ||
Fig. 3 Schematics of the multi-conditional model workflow, using methylamine reacting with acetic acid as an example. |
First, we generated 1024-bit ECFP descriptors from the SMILES strings of reactants and products. These ECFP descriptors, with 1024 bits and a radius of 2, capture the structural features of the molecules involved in the reactions. These fingerprints of the substrates and product of a reaction were concatenated to create a vector of 3072 or point-wise added to keep a size of 1024, as the reaction fingerprint. Each of the different reaction conditions was encoded in a unique integer (1–95). We then used these descriptors to train the XGBoost,47 SVM,48 Random Forest,49 and AutoGluon50 models. Each model was fine-tuned to optimize hyperparameters, ensuring the best performance for yield prediction. AutoGluon, a robust ensemble model, combines the strengths of various machine learning algorithms to improve predictive performance and model robustness.52Fig. 3 uses the modelling workflow of AutoGluon to exemplify the machine learning approaches.
In parallel, we implemented deep learning Yield-BERT14 and T5-Chem51 models. Yield-BERT, based on the BERT architecture13 was trained on reaction SMILES to predict yields by understanding the sequence-to-sequence relationships within the reaction data. Similarly, T5-Chem, a variant of the T5 transformer model,53 was also trained to capture contextual information from reaction SMILES strings, enabling it to predict yields by considering the entire reaction context. To do this, we first tokenized the reaction SMILES strings, converting them into a sequence of tokens that represent the individual components of the molecules. These tokens were then fed through the transformer processes, which include multiple layers of attention mechanisms and feed-forward neural networks. Next, we incorporated the categorical features of reaction conditions into the model. These categorical features were combined with the text features output by the transformer through a combining module. This module integrated the encoded textual information from the SMILES strings with the reaction condition data. Finally, the combined features were passed through fully connected layers to predict the reaction yield. These layers consisted of several dense neural network layers that progressively refined the combined features into a single output value representing the predicted yield. In the above methods, we tried using fingerprints or reaction SMILES to represent the reaction conditions, but these did not represent the conditions well. Since we do not intend to predict outside these conditions, we opted for a categorical encoding approach to maintain clarity and consistency.
For the assessment of HTE-based reaction models, it is important to evaluate the model's performance in a way that reflects real-world applications. While the conventional approach involves using a random split to build a test dataset, recent studies have highlighted that this method can result in overly optimistic performance metrics and may not accurately reflect the challenges faced by chemists in practice. For instance, Doyle and co-workers have argued that random splits allow models to benefit from familiar substrate combinations that might appear in both training and test sets, thereby inflating performance metrics.6 Similarly, Denmark et al. emphasized that random splitting does not represent real-world scenarios where chemists often encounter novel substrate pairs, leading to a significant drop in model performance when faced with external validation.12,32 These findings underscore the importance of more stringent evaluation methods, such as using partially and fully external test sets.
In line with these recommendations, our study adopted a more rigorous assessment strategy by developing three levels of test sets, as shown in Fig. 4. The “random split” involves randomly dividing the dataset into training and test sets, and while it ensures exposure to a broad range of substrates and conditions, it may still present overly optimistic results. To address this, we also created a “partial substrate novelty” test set, which excludes any test cases where both substrates were seen during training. This approach ensures that at least one novel substrate is present, offering a more challenging and realistic evaluation of the model's predictive capabilities. Finally, the “full substrate novelty” test set consists entirely of new substrate combinations that the model has not encountered during training, providing the most rigorous assessment of its generalizability. These three levels of testing—random split, partial substrate novelty, and full substrate novelty—offer a comprehensive framework to evaluate the model's robustness and applicability in real-world chemical spaces, aligning with the best practices recommended in recent literature.
The results indicate that models performed better on the random split and partial substrate novelty test sets compared to the full substrate novelty test set. This could be inferred from the lower MAE and RMSE values and higher R2 values for the first two splits (shown in Table 1). These findings align with our expectations that models trained on datasets where they are exposed to a broad range of substrates and reaction conditions perform better on familiar substrates, but their performance drops when predicting yields for entirely new substrate pairs.
Test data split | Metrics | XGBoost | SVM | RF | AutoGluon | Yield-BERT | T5-Chem |
---|---|---|---|---|---|---|---|
Random split | R 2 | 0.32 | 0.25 | 0.35 | 0.55 | 0.66 | 0.53 |
MAE | 18% | 19% | 17% | 15% | 10% | 16% | |
RMSE | 22% | 23% | 21% | 20% | 15% | 22% | |
Partial substrate novelty | R 2 | 0.26 | 0.23 | 0.26 | 0.66 | 0.68 | 0.58 |
MAE | 4% | 16% | 14% | 13% | 14% | 20% | |
RMSE | 20% | 21% | 19% | 18% | 10% | 15% | |
Full substrate novelty | R 2 | 0.25 | 0.22 | 0.26 | 0.42 | 0.63 | 0.58 |
MAE | 20% | 22% | 19% | 17% | 15% | 22% | |
RMSE | 24% | 27% | 23% | 22% | 11% | 17% |
In the random split test dataset, AutoGluon and Yield-BERT achieved the best results, with R2 values of 0.55 and 0.66, respectively. These models outperformed SVM, Random Forest, T5-Chem, and XGBoost in terms of predictive accuracy. Notably, Yield-BERT consistently demonstrated strong performance, retaining a relatively high R2 of 0.63 even on the full substrate novelty dataset, followed by T5-Chem with an R2 of 0.58 and AutoGluon at 0.42. This suggests that transformer-based models such as Yield-BERT and T5-Chem, along with ensemble methods such as AutoGluon, exhibit greater robustness and generalizability across varying substrate combinations. To further assess the novelty of the test sets, we quantified molecular similarity using the Tanimoto coefficient. A pairwise comparison of product molecules in the training and random split test datasets yielded an average similarity of 0.20, indicating considerable structural diversity between the two sets. Full details of this calculation are provided in Section S3.7 of the ESI.†
We could transform the yield prediction under multiple conditions into an iterative prediction within a condition list, a method we termed selected condition model prediction. In this approach, all reaction data within a single model were generated under the same set of conditions, thus eliminating reaction contexts such as condensation reagents, catalysts, bases, and solvents. This allowed the model to focus solely on the relationship between substrates and products, leading to improved learning and predictive accuracy. However, a significant challenge with this method is the potential loss of critical reaction condition information. Since reaction conditions play a crucial role in determining the outcome of chemical reactions, ignoring them can lead to incomplete models that do not accurately reflect real-world scenarios. To address this issue, we incorporated intermediate information based on reaction mechanisms into our model.
To evaluate our concept, we chose six different conditions with various coupling reagents that are frequently used in the literature and have well-defined intermediates. For the condition selection, we performed a statistical analysis of the literature-reported amide coupling reactions curated from Reaxys.15 We identified the 25 most frequently used conditions, as shown in Table S11 of the ESI,† and selected six for model development and assessment, as shown in Table 2. These conditions were chosen based on their prevalence in the dataset, ensuring that our HTE conditions were both representative and relevant to a wide range of amide coupling reactions. In our investigation of selected condition models, we followed the meticulous approach in preparing our dataset, ensuring it mirrored the model's rigor through three distinct datasets: random split, partial substrate novelty, and full substrate novelty.
Condition # | Activation reagent | Additive | Base | Solvent |
---|---|---|---|---|
1 | HATU | DIPEA | DMF | |
6 | TBTU | DIPEA | DMF | |
13 | EDC.HCl | HOBT | DIPEA | DMF |
21 | HBTU | DIPEA | DMF | |
34 | PyBOP | DIPEA | DMF | |
79 | DCC | HOBT | DMF |
To simplify the complexity of the reaction system in multi-condition amide coupling reaction modeling, we developed multiple single-condition models for the selected conditions by removing condition variables as mentioned above. Meanwhile, we incorporated intermediate information based on reaction mechanisms by using reaction SMARTS templates to represent the formation of activated acid intermediates. For example, in the presence of HATU as a condensation reagent, the transformation of an acid to its activated intermediate was represented using the following template shown in Fig. 5(a).
![]() | ||
Fig. 5 (a) Transformation of an acid into an intermediate SMARTS pattern using HATU as the activation reagent, and (b) schematics of the selected single-conditional model workflow. |
This template converts the acid into the activated acid. We applied specific SMARTS templates for all six conditions, which are detailed in the code repository. Next, we added the intermediate information into the reaction contexts, allowing the model to learn the effect of intermediates on the reaction outcome. To generate descriptors, we experimented with three approaches for generating the reaction context for the selected condition model, using the following patterns:
(1) No intermediate.
(2) Amine + acid + intermediate → amide.
(3) Amine + intermediate → amide.
As shown in Fig. 5(b), the reaction contexts were vectorized into ECFP fingerprints and also converted into reaction SMILES, effectively capturing the structural features of the reactants and products, along with crucial intermediate information. This approach ensured that the model considered the essential reaction conditions indirectly through the intermediate representation. The ECFP fingerprints and reaction SMILES were then used to train the selected condition models using similar machine learning and deep learning algorithms, respectively, as those employed for multi-condition models (Fig. 3). However, in this case, the reaction conditions were no longer concatenated with the reaction vector. We employed the same rigorous testing protocols as used for the multi-condition models, evaluating performance across random split, partial substrate novelty, and full substrate novelty datasets.
Our results reveal that the BERT model trained on the random split dataset usually delivered superior performance, characterized by lower Mean Absolute Error (MAE) and Mean Squared Error (MSE), alongside higher R2 values. This trend indicates that fewer variables enhance model accuracy. Moreover, descriptors incorporating intermediate information indeed enhanced performance. Specifically, under HATU and TBTU conditions, R2 values surged from 0.69 and 0.71 to 0.86 and 0.84, respectively, with corresponding decreases in errors, underscoring the efficacy of our intermediate strategy. This robust performance of intermediate-inclusive descriptors persisted even in the full substrate novelty dataset, where the BERT model retained an R2 value of approximately 0.8 across all reaction conditions, albeit with slight reductions (Table 3). Among the intermediate-inclusive descriptors, the amine + intermediate approach usually outperformed the amine + acid + intermediate strategy across all reaction conditions when using the BERT algorithm (more metric details can be found in Table S13 of the ESI†). This observation aligns with the reaction mechanism, where amines and acids form intermediates before converting to products. Since our intermediate is represented as an activated acid, it already encapsulates acid information, making the amine + acid + intermediate descriptors redundant. Consequently, the more precise amine + intermediate descriptors yield better results by avoiding redundant information and focusing on the critical reaction components. To ensure data consistency, we performed 5-fold cross-validation on the randomly split test datasets, as detailed in Section 3.9 of the ESI.† However, 5-fold cross-validation was not applicable to test sets with partial or full substrate novelty due to dataset constraints; these were evaluated using a single data split. The 5-fold cross-validation results, which are illustrated through scatter plots in Section 3.9 of the ESI,† comparing model predictions to actual yields, validate the reliability of our random split data and further demonstrate the performance of our model.
Test data split | Metrics | TBTU | HATU | PyBOP | DCC | HBTU | EDC |
---|---|---|---|---|---|---|---|
Random split | R 2 | 0.84 (0.71) | 0.86 (0.69) | 0.90 (0.80) | 0.86 (0.80) | 0.89 (0.83) | 0.89 (0.82) |
RMSE | 10% (13%) | 9% (14%) | 8% (11%) | 9% (11%) | 9% (11%) | 8% (11%) | |
MAE | 7% (10%) | 6% (10%) | 5% (8%) | 7% (8%) | 6% (8%) | 6.1% (7%) | |
Partial substrate novelty | R 2 | 0.77 (0.57) | 0.78 (0.53) | 0.82 (0.63) | 0.81 (0.74) | 0.86 (0.72) | 0.88 (0.79) |
MAE | 12% (16%) | 12% (17%) | 10% (14%) | 11% (13%) | 9% (13%) | 9% (12%) | |
RMSE | 8% (12%) | 8% (13%) | 7% (11%) | 8% (9%) | 7% (10%) | 6% (8%) | |
Full substrate novelty | R 2 | 0.85 (0.66) | 0.84 (0.39) | 0.89 (0.40) | 0.67 (0.1) | 0.83 (0.68) | 0.75 (0.46) |
MAE | 9% (13%) | 7% (14%) | 8% (18%) | 7% (12%) | 10% (14%) | 14% (18%) | |
RMSE | 7% (11%) | 6% (11%) | 6% (12%) | 5% (10%) | 7% (8%) | 11% (13%) |
In contrast, the XGBoost algorithm's performance lagged behind the multi-condition model, and the inclusion of intermediate descriptors did not enhance results, resulting in marginal declines (details shown in Table S13 of the ESI†). This discrepancy between BERT and XGBoost is likely attributable to algorithmic differences. XGBoost, a machine learning algorithm, excels in learning simple reactions but struggles with the complexity added by intermediate descriptors. In contrast, the deep learning-based BERT model thrives on this additional complexity, leveraging it to improve predictive accuracy. Besides XGBoost, we also investigated other algorithms' performance after intermediate knowledge was embedded into the model. However, no better result was obtained in all cases (details shown in Table S13 of the ESI†).
By incorporating intermediate information, our selected condition models demonstrated significantly improved performance. The intermediate-powered model achieved an R2 of 0.86, compared to an R2 of 0.69 for the model without intermediate incorporation. This innovative strategy not only enhanced model accuracy but also provided a balanced approach that integrates condition-specific data with broader chemical knowledge, ultimately improving the robustness and generalizability of yield predictions for amide coupling reactions. This comprehensive approach ensures that our selected condition models are capable of accurately predicting reaction yields while considering the crucial role of reaction conditions through intermediate representations, thereby providing reliable and practical tools for chemists in optimizing amide coupling reactions. Additionally, when we encoded the reaction conditions using a one-hot approach, the model's performance significantly decreased, underscoring the conclusion that this method lacks meaningful chemical information (details as shown in Section S3.10 of the ESI†), and indicating the importance of intermediate knowledge (details as shown in Section S3.8 of the ESI†).
Having identified the power of intermediate knowledge embedded in a model, we next aim to determine whether our strategy would also work well in the case of combining data from all six different conditions into one dataset. Indeed, the performance of the BERT model enhanced by intermediate knowledge improved, but the growth rate in performance was less than that observed in selected condition predictions, especially in cases of complete novelty splitting, as shown in Table 4. This may be because the model needs to learn the relationships among the six different conditions, but the dataset is insufficient for the model to learn these relationships effectively.
Intermediate information | Test data split | R 2 | RMSE | MAE |
---|---|---|---|---|
Without intermediate knowledge | Random split | 0.77 | 12% | 9% |
Partial novelty | 0.71 | 14% | 10% | |
Full novelty | 0.62 | 10% | 8% | |
With embedded intermediate knowledge | Random split | 0.85 | 10% | 7% |
Partial novelty | 0.8 | 11% | 8% | |
Full novelty | 0.65 | 9% | 8% |
In summary, it is evident that intermediate-inclusive descriptors yield better results not only in selected condition predictions but also in multi-condition predictions. The absence of intermediate descriptors leads the model to erroneously assume that reactions depend solely on substrates, ignoring the significant impact of reaction conditions. This misassumption explains why models without intermediate descriptors perform well in the random split and known single-substrate datasets but experience sharp performance drops in the full novelty substrate dataset.
A series of functional groups, such as alkyne (5, 11, 12), azide (2, 14), hydroxyl (11, 16), halide (4, 6, 8, 9, 18), carboxyl (7), phosphonate (16), aldehyde (6), and others, were tolerated by our prediction model, indicating its broad applicability. For some substrates, the amidation transformations were challenging, with complex relationships due to chemoselectivity arising from special functional groups, such as amine (1, 7), hydroxyl (11, 16), and carboxyl groups (7). Thus, achieving accurate yield predictions for these transformations was difficult. Nevertheless, the model provided rather accurate predictions, suggesting that it effectively learned the complex structure–yield relationships. Our model also performed well with heterocycles (4, 7, 9, 15, 18) and the sulfonamide group (12), which are commonly found in drugs. Moreover, the model appeared to have no bias toward yield distribution during prediction. Given the structural diversity and highly accurate predictions for the aforementioned reactions, the model appears to have achieved a considerable balance between sensitivity and robustness (Fig. 7).
The above series of studies have shown that the generalization ability of the yield prediction model could indeed be greatly improved after embedding intermediate knowledge. Therefore, we were particularly interested in whether we can recommend suitable conditions for some reactions with low yields to improve their yields through the model. With this question in mind, we selected 5 reactions with yields below 40% from the above 257 reactions, and their corresponding substrates were all commercially available. Subsequently, we used the prediction model to predict the yields of these 5 reactions under the selected 6 conditions, and repeated the experiments under the condition with the highest predicted yield. As shown in Fig. 8, the yields of compounds 20, 21, and 22 were all improved dramatically. Although the yield of compound 23 did not increase under the top one condition, it was also significantly improved under the top two conditions. More results can be found in Tables S18 and S38 of the ESI.† These results indicated that our model can indeed recommend appropriate reaction conditions for some reactions, helping chemists to synthesize corresponding amide compounds with higher yields.
![]() | ||
Fig. 8 The literature reported yield of five selected external literature reactions vs. experimental yield from model recommended conditions. |
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5sc03364k |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2025 |