Bojana
Ranković
*a,
Ryan-Rhys
Griffiths
b,
Henry B.
Moss
c and
Philippe
Schwaller
*a
aLaboratory of Artificial Chemical Intelligence (LIAC), National Centre of Competence in Research (NCCR) Catalysis, Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland. E-mail: bojana.rankovic@epfl.ch; philippe.schwaller@epfl.ch
bDepartment of Physics, University of Cambridge, UK
cDepartment of Applied Mathematics and Theoretical Physics, University of Cambridge, UK
First published on 2nd November 2023
Reaction additives are critical in dictating the outcomes of chemical processes making their effective screening vital for research. Conventional high-throughput experimentation tools can screen multiple reaction components rapidly. However, they are prohibitively expensive, which puts them out of reach for many research groups. This work introduces a cost-effective alternative using Bayesian optimisation. We consider a unique reaction screening scenario evaluating a set of 720 additives across four different reactions, aiming to maximise UV210 product area absorption. The complexity of this setup challenges conventional methods for depicting reactions, such as one-hot encoding, rendering them inadequate. This constraint forces us to move towards more suitable reaction representations. We leverage a variety of molecular and reaction descriptors, initialisation strategies and Bayesian optimisation surrogate models and demonstrate convincing improvements over random search-inspired baselines. Importantly, our approach is generalisable and not limited to chemical additives, but can be applied to achieve yield improvements in diverse cross-couplings or other reactions, potentially unlocking access to new chemical spaces that are of interest to the chemical and pharmaceutical industries. The code is available at: https://github.com/schwallergroup/chaos.
Additives are critical for altering the reactivity and outcome of chemical reactions.40,41 According to the IUPAC Gold Book definition, additives are “substances that can be added to a sample for any of a variety of purposes”.42 They are crucial in a range of chemical processes, including polymer synthesis, pharmaceutical development, and materials science.43–45 Identifying optimal additives can significantly enhance reaction efficiency, selectivity and yield, leading to cost-effective and sustainable chemical processes.46,47 In this study, we introduce a BO-based approach for efficient exploration of the additive§ search space. Subsequently, we explore a range of representation methods to determine the most appropriate ones for uncovering additive-induced yield improvements. This approach not only streamlines experimental design and optimisation but also holds immense promise for various applications within the field of chemistry. While Prieto Kullmer et al.39 screened these compounds using high-throughput experimentation (HTE), not all laboratories can access robotic platforms. Synthetic chemists, however, could highly benefit from using BO to discover the optimal additives, allowing them to improve a reaction without the need for exhaustive (and expensive) testing of all possible combinations (Fig. 1). Compared to existing applications of BO to chemical reactions (e.g., Buchwald–Hartwig reactions48), the additive dataset is substantially more challenging. Firstly, OHE is ill-suited for this task as it results in high-dimensional vectors, with only one active dimension per additive. The resulting extreme sparsity and lack of shared information make it difficult to address the complexities of the dataset. This kind of representation limits the use of machine learning models, which can struggle to extract valuable insights. While seemingly intuitive, we empirically confirm these shortcomings, with details in the results section. As we demonstrate, applying BO in this setting fails to improve over random search. Secondly, the additives in this paper exhibit greater structural diversity than the components screened in other HTE studies. This distinctiveness significantly increases the computational demands for generating human-labelled atomic or local QM descriptors. We overcome these limitations by using computationally efficient reaction and molecular representations with a maximal diversity initialisation scheme and flexible surrogate models. Finally, the inherent complexity of the dataset coupled with the limited predictive signal¶ between the representations and the output (yield) poses a significant challenge for optimisation. Existing research, however, suggests that the application of BO can still help reach promising results even in those scenarios.49 Despite these challenges, we demonstrate that augmenting BO with adequate reaction representations, initialisation schemes and appropriate surrogate models results in an efficient search towards the best-performing additives in less than 100 evaluations while using as little as ten initialisation reactions.
Fig. 1 Visualisation of Bayesian optimisation pipeline for additive screening. Starting from the HTE dataset,39 we extract either additive smiles or reaction components to generate reaction smiles. We propagate these smiles through a molecular (i.e.) fingerprints, fragprints, xtb, cddd, mqn, chemberta or reaction encoder (rxnfp, drfp) into features. The built features allow us to select initial points leveraging methods like clustering to set up the Gaussian process surrogate model. The BO loop then runs for a predetermined number of iterations with the objective of reaching the global optimum that corresponds to the highest UV210 product area absorption. |
The structure of this paper is as follows: Section 2 details the data and the representations, Section 3 covers methodology, followed by a presentation and discussion of results in Section 4. We conclude and offer future directions in Sections 5 and 6.
Reaction plate 1: investigates the impact of additives on the decarboxylative C–C coupling of Informer X2 (a highly reactive aryl halide substrate)50 and cyclohexanoic acid.
Reaction plate 2: explores the influence of additives on the coupling between 3-bromo-5-phenylpyridine and cyclohexanoic acid.
Reaction plate 3: examines the role of additives in the coupling of 7-bromo-3,4-dihydroquinolin-2(1H)-one with cyclohexanoic acid.
Reaction plate 4: assesses the effect of additives on the coupling between Informer X2 and hexanoic acid.
This property offers two primary ways to encode these reactions: one by isolating the additive and the other by considering the holistic reaction. The former pertains to molecular representations of the additive, while the latter could involve reaction fingerprints or global QM descriptors.
Together with mqn descriptors,56–58 they offer representations summarising molecular structure. In our experiments, we leveraged both mqn descriptors and Morgan fingerprints (referred to as fingerprints henceforth). Additionally, we explored combining fingerprints with encoded fragments of a molecule (computed using RDKit59), essentially forming a more comprehensive representation (aptly coined fragprints60–63). The enriched fragprints provide insights into the overall structure and the specific constituents of the molecule. Though computationally efficient compared to descriptors involving intensive human labour or simulations, traditional cheminformatics descriptors might not capture the complexity of chemical interactions.
We focus on data-driven methods that utilise simplified molecular-input line-entry system (SMILES) representations.67 smiles codes are textual representations of molecules that encode the molecular graph structure in a simple string format. Their textual nature allows employing advanced machine learning models originally designed for natural language processing tasks.68 These models can interpret the ‘syntax’ and ‘grammar’ of smiles to extract chemically meaningful features leading to a multitude of data-driven representation methods.69–73 In this study, we specifically employ two data-driven molecular descriptors. First, CDDD (Continuous and Data-Driven molecular Descriptors), which translates between semantically equivalent but syntactically different molecular structures like smiles and InChI representations.70 Second, ChemBERTa, a BERT-based model pre-trained on a large corpus of chemical smiles strings using an optimised pretraining procedure.71,74,75
Recent approaches have looked to map reactions directly to a fingerprint, independent of the number of reaction components and the underlying representation. Schwaller et al.69 derived data-driven reaction fingerprints (RXNFP) directly from the reaction smiles by employing transformer models78 trained for reaction type classification tasks. Reaction smiles is an extension of the regular smiles notation that represents not just a single molecule, but entire chemical reactions. It includes the smiles strings of reactants and reagents on one side (separated by dots) and the product on the other side, separated by a special character “>>”. The benefit of this approach is its ability to map reactions to highly versatile continuous representations regardless of the number of reaction components. However, using rxnfp in this project's scope might not be adequate since additives play a relatively minor role in reaction type classification. On the other hand, Probst et al.79 introduced the differential reaction fingerprint (DRFP). This representation is based on the symmetric difference|| of two sets generated from the molecules listed left (reactants and reagents) and right (products) from the reaction arrow using a method that captures the environments around atoms in a radial manner, termed ‘circular molecular n-grams’. Their design makes them extremely flexible, effectively encoding the interplay of diverse reaction elements and maintaining a robust performance in scenarios where both a single or multiple reaction components may vary.
Several components play crucial roles in determining the outcome of BO-based search strategy. Firstly, the representation of chemical reactions dictates how the model interprets the data. Secondly, the kernel choice in the surrogate model shapes the learned relationships between data points. Thirdly, the initialisation strategy influences the starting point and path of the optimisation process. Lastly, the acquisition function guides the decision on where to sample next.
We applied BO on a dataset of 720 screened additives across four unique reactions aiming to maximise the UV210 product area absorption. To evaluate the BO approach, we initiate the search with a set of 10 starting points. The optimisation process runs for up to 100 iterations during which we monitor the performance against the remaining dataset, comprising over 600 data points. We measure the success of the optimisation by assessing how many of the top performing reactions we identify during these iterations. For this reason, we define a top-n neighbourhood metric as a set of n reactions with the highest yield for each reaction plate. The motivation behind the top-n neighbourhood search is to provide a diverse set of high-performing additives, giving researchers more flexibility in their choice based on factors such as availability, complexity, and price. This approach allows for a more flexible and pragmatic selection of additives and reflects the practical constraints and requirements of real-world applications. To find the optimal configuration, we carry out a grid search over combinations of parameters, namely data representation, kernel, initialisation strategy and acquisition function, repeating the runs across 20 different seed values to ensure robust findings. The limitation with one-hot encoding OHE on this dataset directed us towards the exploration of other molecular and reaction representations, both computationally and chemically reasonable, while steering away from the intensive demands of quantum molecular descriptors. For data representation, we extensively evaluated fingerprints, fragprints, mqn and xtb features, data-driven cddd and chemberta descriptors and holistic reaction representations such as rxnfp, drfp and OHE. We used a Gaussian process surrogate model and assessed the influence of different kernels (Matern, Tanimoto, Linear). To select the 10 starting points we used an initialisation strategy (random, clustering and maximising the minimum distance between the selected points) and for guiding the search towards promising regions we compared acquisition functions—upper confidence bound (UCB) and expected improvement (EI). The core objective of this study was to identify whether BO can emulate or even surpass the outcomes of HTE and if so, under which configuration. We used the first of the four available reactions to evaluate the combinations of parameters over 20 different seed-runs and finally carried out the optimisation loop for the remaining reactions using the best-performing setup.
Below, we delve deeper into each of the necessary BO elements (Table 1), explaining our choices and their implications.
Variables | Values |
---|---|
Kernel | Matern, Tanimoto, Linear |
Initialisation | Clustering, Maxmin, Random |
Acquisition | ei, ucb |
Representation | drfp, rxnfp, OHE, cddd, xtb, fingerprints |
fragprints, mqn, chemberta |
(1) |
This definition allows a variety of models to act as surrogate components in the Bayesian optimisation setup. Any model that can output predictions over the input space and confidence over predictions is a potential choice for a surrogate model. A favoured selection is often a Gaussian process because of its flexibility, simplicity, and ability to capture complex functions with relatively few hyperparameters to tune (admitting second-order optimisers such as L-BFGS-B,90 for the marginal likelihood loss function).
Gaussian processes easily adapt to different problem domains by changing the kernel function, which defines the covariance structure between input points. In Gaussian processes, kernel functions measure the similarity between data points in the input space. This similarity is then used to predict the function value for a new input by considering its proximity to previously evaluated data points. Different kernel functions can capture different types of relationships between data,91,92 and their choice plays a significant role in determining the properties of the surrogate model, such as smoothness, periodicity, and stationarity. Selecting kernel functions that are appropriate for the chosen reaction representations is essential in the context of reaction optimisation.
Among the kernels developed for chemical reactions, we find the Tanimoto kernel62,93,94 effective for binary representations due to its ability to quantify structural overlap. The Linear kernel is often sufficient if the descriptors are informative enough or the problem has a Linear nature. Additionally, we consider the Matern kernel for its flexibility in capturing varying degrees of smoothness in the data, making it a suitable choice for more complex reaction spaces.
Reaction design space contains potential combinations of additives, reactants, catalysts, solvents, and reaction conditions such as temperature, pressure and concentration. In this study, the focus narrows down to a set of possible additives.
On the other hand, Bayesian optimisation configuration design space includes model parameters and optimisation frameworks that enable effective exploration of the reaction design space. Here we explore parameters such as the choice of reaction representations, kernel functions and data initialisation methods. Understanding the interplay between these factors is key to achieving efficient search and optimisation. For example, kernel choice may depend on the reaction representation which, as a consequence, dictates the optimisation success.
In the domain of chemistry, operating within a low data regime is often the norm rather than the exception. Furthermore, chemists might face a dual incentive when empowered with BO solutions: starting the optimisation process early to save time and resources, while also needing a diverse set of data to initialise the optimisation models effectively.
A Gaussian process surrogate model, although well-suited to limited data settings, achieves a more comprehensive understanding of the underlying data function when initialised with a diverse set of data points incorporating prior knowledge of the design space.99–101 This selection leads to increased precision in uncertainty measurements, and subsequently, more accurate model predictions. To achieve this, we employ maximum diversity initialisation schemes that enable us to explore the structured search space of reactions and select a representative sample of points to accelerate the optimisation process. These schemes include clustering, maximal coverage, and random sampling baseline.
Fig. 2 t-SNE visualisation102 of the fragprints representation of Reaction 1 in the latent space. The colours describe the clusters, highlighting the central additives with their corresponding molecular structures. We discover the phthalimide additive, identified as the best overall additive in the original study, within the initial clustering. This compelling side effect of the clustering demonstrates its ability to effectively describe the latent space and identify appropriate initial additives. |
To reiterate, we focus on identifying the top-performing reactions within the evaluated BO iterations. This is measured using the top-n neighbourhood metric, aiming for a selective and diverse array of high-yield reactions. As a compromise between top one and top 10 discovered additives and aiming for a clear visualisation we show the percentage of top five performing additives discovered during the optimisation process across different representations in the Fig. 3. The same plot shows a significant importance of the reaction representation choice for the success of the BO strategy. Given the unique nature of additive screening, we can encode reactions using either reaction or molecular descriptors. As the additive is the only variable component in additive screening, it uniquely describes each data point per reaction in the dataset. However, reaction representations, like drfp, inherently capture more comprehensive information by considering the interplay of all reaction components. This representation emerged as particularly effective, especially when combined with Matern kernel, contrasting our expectations about the binary-tailored Tanimoto kernel. Moreover, Matern kernel dominates other alternatives over majority of representations highlighting its adaptability and robustness.
Focusing on the internal structure of additives only, both fingerprints and fragprints emerge as strong contenders. The slight advantage of fragprints suggests the potential relevance of molecular fragments in the context of evaluated additives. Among the continuous representations, data-driven feature-rich representations such as cddd underperform in BO tasks despite having higher validation scores in model fit related metrics (see Fig. 1 in ESI).† This outcome may be due to the overcomplexity of this representation (continuous 512-dimensional vectors) accompanied by the constraints of a low data regime. While cddd can capture intricate chemical features and relationships, it also introduces a high degree of complexity into the model which can be challenging to decipher with only a small number of points in the initialisation.
Importantly, we are often inclined to associate the complexity of the feature with its dimensionality. While the connection can be made for continuous representation, in binary representations, the high dimensionality often takes on a different meaning due to the nature of the input space. As a consequence, binary data translates to “practical” dimensionality that is generally lower than what one might encounter in a Euclidean space. For example, binary representations in our experiments, such as fingerprints and drfp, form 512-dimensional design spaces (Table 2), but the complexity they introduce to the model is significantly lower compared to the 512-dimensional continuous cddd representations, enhancing the BO performance as a result.
Repr. | Dimension | Type |
---|---|---|
drfp | 512 | Binary |
fragprints | 597 | Binary |
fingerprints | 512 | Mixed |
cddd | 512 | Continuous |
rxnfp | 256 | Continuous |
xtb | 11 | Continuous |
mqn | 42 | Continuous |
ohe | 722 | Binary |
chemberta | 768 | Continuous |
Another data-driven reaction representation such as rxnfp shares similar path to cddd as shown in Fig. 4. We can use the same argument based on low-data rich-features coupling setup as with cddd, yet with a considerable difference in the encoded information between the two representations. rxnpf allow us to encode the whole reaction with interrelation between additives and other reaction components. However, the design of rxnfp may not be well-suited for task at hand. Out of the box, the rxnfp representation aims to capture the global information of a reaction including all reactants, reagents, and the transformation itself. They encode information valuable for distinguishing reaction types. In the unique setup of additive screening, where the only variable component is the additive, this global reaction information may dilute the effect of the additive, considering their limited role in this task and therefore undermine the performance of BO.
Xtb features, on the other hand, include properties related to the additive's electronic structure and molecular properties and result in low-dimensional continuous representations. However, similar to drfp, they show an increased sensitivity to the choice of kernel. The discovery of the phthalimide ligand additive in the original study and the consequent mechanistic understanding it provided39 served as initial reasoning why xtb features might be an effective representation for BO search in this paper. The specific electronic properties of phthalimide, such as its electron-withdrawing capacity, significantly influence the oxidative addition step. These properties play a crucial role in facilitating the reaction by stabilising the transition state or the reactive intermediates. The xtb features, capturing these electronic properties, should provide a detailed and nuanced representation of the additive. In scenarios where the additive's electronic structure is the primary determinant of its performance, xtb features might offer a significant advantage, but they omit other crucial information. Moreover, xtb demand for custom calculations, and they might not be ideal in cases where other factors, such as steric effects, define the reaction outcome.
The remaining representations—mqn, chemberta and as anticipated, OHE—show below-par performance, indicating their limited utility in BO search. Given its inherent design we expected OHE to result in poor exploitability of the data and therefore limit the model in learning from this representation. As a consequence, the outcome is often worse than random search as shown in Fig. 4. Similarly, mqn and chemberta perform on par with random search.
Following on the influence of various reaction representations, we evaluated the remaining parameters and represented the results in the Table 3. Alongside the data representation, the choice of kernel, initialisation strategy and acquisition function further dictate the success of the Bayesian optimisation process. The table provides an aggregate overview of the performance of each of these parameters, measured in terms of the percentage rate of identifying the top one and top five additives and the validation R2 score evaluated on the remaining 610 additives after the 100 BO iteration starting from the 10 initial compounds. The Matern kernel stands out, achieving the highest success rate in identifying valuable additives, albeit with noticeable variance. Tanimoto and Linear kernels, display lower success rates and inability to adapt to diverging underlying data distribution coming from different reaction representation alternatives. Moreover, the Linear kernel, while having the highest R2 score, performs the least in terms of identifying top additives. As mentioned, this result confirms the premise that the best-performing combination in terms of Gaussian process regression, does not necessarily yield the best results in a Bayesian optimisation setting. This observation underscores the importance of considering the interplay between representations, initialisation strategies, and the broader optimisation context when evaluating performance. The choice of initialisation which determines the starting points for the BO process impacts the trajectory towards the optimal values. Cluster-based initialisation, possibly due to its capability to capture diverse regions of the search space, achieves better BO performance scores. The ucb acquisition function slightly outperforms EI for the BO metrics. However, the R2 score is noticeably higher for EI, signalising that this acquisition function tends to uncover points that improve the surrogate model fit, but fails on leading towards optimal values in the search space. For a more comprehensive evaluation of different parameters and their influence on BO search, refer to Table 1 in the ESI.†
Param. | Type | Top 1 [%] ↑ | Top 5 [%] ↑ | Valid. R2 ↑ |
---|---|---|---|---|
Kernel | Matern | 0.20 ± 0.40 | 0.19 ± 0.31 | −0.02 ± 0.18 |
Tanimoto | 0.08 ± 0.28 | 0.13 ± 0.24 | −0.02 ± 0.18 | |
Linear | 0.02 ± 0.14 | 0.09 ± 0.15 | 0.03 ± 0.14 | |
Init. | Clusters | 0.14 ± 0.34 | 0.18 ± 0.28 | 0.01 ± 0.17 |
Random | 0.10 ± 0.30 | 0.13 ± 0.23 | 0.01 ± 0.15 | |
MaxMin | 0.07 ± 0.26 | 0.11 ± 0.21 | −0.02 ± 0.19 | |
Acq. | UCB | 0.11 ± 0.31 | 0.15 ± 0.25 | −0.04 ± 0.19 |
EI | 0.09 ± 0.29 | 0.12 ± 0.23 | 0.04 ± 0.14 |
In summary, the combined influence of reaction representation, kernel choice, initialisation strategy and acquisition function shapes the BO's ability to efficiently navigate the search space and identify high-yield additives. The results emphasise the importance of rational parameter selection in achieving the full potential of BO for chemical optimisation.
Building up on our analysis, we proceeded to fix the optimal choices for kernel, acquisition function, and initialisation strategy. Specifically, we employed the Matern kernel, ucb acquisition function, and clustering initialisation method. With these choices set, we observed the Bayesian optimisation paths over 100 iterations, averaged across 20 seeds, for each of the representations and reaction plates. Fig. 4 reveals resulting patterns and illustrates the strengths and limitations of each representation in the given setup. fragprints, combined with clustering initialisation, begin the optimisation at a substantially higher level but tend to plateau more quickly. Similarly, clustering initialisation works well for xtb, but they have limited success in reaching optimal additives. Impressively, however, both of these representation tend to uncover additives from the higher end of the complex long-tailed target distribution early in the BO loop, facilitated by the clustering of the design space. On the other hand drfp, even though starting from a set of additives with lower objective values, exhibits consistent growth, eventually steering towards the optimum. cddd representation fails to reach the high-yielding regions of the search space underscoring the idea that it is not ideally suited for the optimisation task at hand. The fingerprints representation, despite its third-place position in the previous analysis (see Fig. 3), show mixed results across reaction plates in this specific setup, often performing similarly to random search. This result highlights the sensitivity of BO to the alignment of representation and chosen parameters as the best configuration for this representation included EI as the acquisition function and Tanimoto kernel for the surrogate model. Meanwhile, rxnfp, in combination with Matern kernel lags behind, reinforcing the notion of its optimal pairing with simpler kernels. As expected, OHE consistently performs among the worst, underperforming even against a random search. Its inherent sparsity and lack of inter-data point information render it ill-suited for the task. As a comparison, we also evaluated reaction-level representations: OHE, rxnfp and drfp on Buchwald–Hartwig dataset. Interestingly, OHE has been reported to perform particularly well on this data. Notably, in line with the findings from our primary study, drfp exhibited consistent and robust performance, showcasing its universal applicability in Bayesian optimisation scenarios across datasets with differing requirements and constraints. For more details on the results on this dataset, refer to the Section A.5 in the ESI.†
Future research should focus on determining the optimal reaction representation, or possibly a dynamic combination of representations for employing bo on different reaction types while incorporating domain knowledge. For example, switching from one reaction representation to another during the BO search. This strategy would allow to incorporate benefits of initialising the search at higher objective values while also reaching the optimum; or incorporating data-driven descriptors only once we have collected enough data for their optimal performance.
In this regard, several factors warrant further development. Firstly, potential biases in the dataset and assumptions made in the modelling could impact the generalisability of the results to other chemical reactions. Future work should focus on validating the methodology using diverse datasets and reaction types to ensure robustness and applicability across different contexts. Secondly, while this study investigated several reaction representations and initialisation strategies, additional research should explore alternative representations and strategies that may further improve the performance of BO in additive discovery by adapting to specific reaction types. For example, data-driven representations, although powerful, failed to deliver encouraging results in BO in this study. They could benefit from custom specifically designed surrogates or fine-tuning strategies on the datasets at hand.
By addressing these future research directions and refining the BO methodology, the chemical research community can benefit from further advancements in the powerful optimisation approach, ultimately contributing to a more efficient and comprehensive understanding of chemical reactions and their optimisation potential. The research can also extend to a broader range of chemical reactions and applications, such as high-throughput settings where batches of reactions can be evaluated simultaneously.103,104
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00096f |
‡ “Ni-catalysed photoredox decarboxylative arylation” refers to a chemical reaction where a nickel catalyst, combined with light energy, enables the removal of a carboxyl group from a molecule while introducing an aryl group. Common in organic synthesis, this methodology allows crafting molecules with specific chemical attributes. |
§ The term “additive” refers to a single selection from a set of 720 examined additives. |
¶ The term “limited signal” refers to the low validation scores indicating poor alignment between the data representations and the desired output in the low data regime. This complexity results in a challenging modelling scenario. |
|| The symmetric difference encapsulates elements that are in either set, but not in the intersection. |
** The k-medoids method is similar to k-means but uses the most centrally located data point in a cluster (medoid) to represent that cluster. This method can employ various distance metrics, allowing it to be more flexible based on the data representation type. |
This journal is © The Royal Society of Chemistry 2024 |