Runzhe
Liu
a,
Zihao
Wang
b,
Wenbo
Yang
*a,
Jinzhe
Cao
*a and
Shengyang
Tao
*a
aSchool of Chemistry, State Key Laboratory of Fine Chemicals, Frontier Science Center for Smart Materials, Dalian Key Laboratory of Intelligent Chemistry, Dalian University of Technology, Dalian, 116024, China. E-mail: wbyang@dlut.edu.cn; ginercao@mail.dlut.edu.cn; taosy@dlut.edu.cn
bFaculty of Information, Beijing University of Technology, Beijing, 100124, China
First published on 12th August 2024
The integration of artificial intelligence (AI) and chemistry has propelled the advancement of continuous flow synthesis, facilitating program-controlled automatic process optimization. Optimization algorithms play a pivotal role in the automated optimization process. The increased accuracy and predictive capability of the algorithms will further mitigate the costs associated with optimization processes. A self-optimizing Bayesian algorithm (SOBayesian), incorporating Gaussian process regression as a proxy model, has been devised. Adaptive strategies are implemented during the model training process, rather than on the acquisition function, to elevate the modeling efficacy of the model. This algorithm facilitated optimizing the continuous flow synthesis process of pyridinylbenzamide, an important pharmaceutical intermediate, via the Buchwald–Hartwig reaction. Achieving a yield of 79.1% in under 30 rounds of iterative optimization, subsequent optimization with reduced prior data resulted in a successful 27.6% reduction in the number of experiments, significantly lowering experimental costs. Based on the experimental results, it can be concluded that the reaction is kinetically controlled. It provides ideas for optimizing similar reactions and new research ideas in continuous flow automated optimization.
In previous cases, some adaptive strategies have been adopted and applied to acquisition functions. Acquisition functions calculate the mean and variance through a surrogate model to determine the next point to explore. The next point to explore is obtained by balancing the exploitation of the objective function, the exploration of unknown areas of the space, and the prediction of risk and uncertainty.29 For instance, Bourne et al. employed an adaptive expected improvement strategy when dealing with continuous variable optimization problems, while Clayton et al. used the same strategy for mixed variable optimization.15,30 This dynamically adjusted expected improvement strategy, achieved by modifying the implicit binding to the underlying model, allows the model to efficiently acquire the next target data for acquisition while ensuring accuracy. Similarly, Dragonfly's Bayesian optimization tool utilizes a lower confidence bound-based acquisition function, encouraging the algorithm to explore unknown regions by adjusting hyperparameters.17,31 These self-optimizing strategies applied to acquisition functions have been successfully used in the optimization of continuous flow synthesis. However, adaptive strategies for kernel functions, which are the core of GPR models, are lacking. In previous cases, GPR algorithms often employed a single kernel with fixed hyperparameters for model training. Although this approach is effective for reaction optimization, its fixed model structure may lead to inaccurate or inefficient predictions. Therefore, this study explores the application of adaptive strategies to the model training process, constructing an adaptive strategy for kernel functions to obtain a self-optimizing Bayesian optimization algorithm and unleash more potential between continuous flow synthesis process optimization and automatic control. To verify the feasibility of the designed self-optimizing Bayesian algorithm, this study only adopts adaptive strategies during model training and does not design acquisition functions. Instead, it directly outputs the next objective function by traversing sample points throughout the entire search space.
This study optimizes the continuous flow synthesis process for nitrogen-containing heterocyclic amides, widely found in pharmaceuticals, proteins, and functional materials. Pyridinylbenzamide, a vital drug intermediate, was chosen due to its versatility in synthesizing diverse active compounds, such as luciferase inhibitors32 and anti-ulcer drugs.33 The synthesis of pyridinylbenzamide via the Buchwald–Hartwig reaction, offering higher selectivity and conversion rates, was demonstrated.34–36 Given the inherent efficiency, safety, and stability of continuous flow synthesis in producing APIs, it becomes imperative to integrate the synthesis of pyridinylbenzamide into this methodology. In this study, the Buchwald–Hartwig reaction was employed for the synthesis of pyridyl benzamide, efficiently synthesizing pyridyl benzamide using 2-bromopyridine and 4-methoxybenzamide (anisamide) as reactants. Gaussian process regression was utilized as a surrogate model to construct a Bayesian optimization algorithm for predicting yields through regression and outputting corresponding reaction parameters. By iteratively using different kernels, combined with n-fold cross-validation and Bayesian hyperparameter optimization, adaptive strategies were applied during model training to achieve self-optimization of the algorithm. Using 14 sets of prior data, optimal process parameters and yield were obtained after 15 iterations. Subsequently, optimization was performed using 5 sets of prior data with shorter retention times selected from the initial 14 datasets. Remarkably, the same optimization results were achieved after 16 iterations. The experiment results indicate that the self-optimizing Bayesian optimization algorithm performs well with small-scale datasets and efficiently optimizes process parameters to achieve a yield of 79.1% while reducing the number of experiments by 27.6%.
The flow synthesis route involves pumping a solution of bromopyridine, anisamide, palladium acetate, and xantphos in DMF through one line, while a separate line delivers DBU dissolved in DMF at the specified concentration (Fig. 1a). These two fluids were combined using a mixer and then introduced into a standard PFA coiled tube reactor with a holding capacity of 10 mL for the heating reaction. Following the PFA coil reactor, a column reactor filled with silica gel is attached to filter out palladium from the system and cool it to terminate the reaction.
Fig. 1 Experiments on the flow synthesis of pyridinylbenzamide. (a) Platform configuration. (b) Continuous flow reaction path and optimized variables. (c) Bayesian optimized synthesis process flow. |
The independent variables for optimization included reaction temperature, retention time, and equivalents of anisamide, DBU, and xantphos, with yield as the objective function (Fig. 1b). The concentration of palladium acetate, as the catalyst feedstock for the Buchwald–Hartwig reaction, significantly affects the reaction. However, higher concentrations of palladium result in increased costs. Typically, the range of equivalents for palladium catalysts is 2–6 mol%.40 Instead of adjusting the equivalent of palladium acetate, a lower equivalent of palladium (3 mol%) was chosen to reduce costs and better assess the impact of the optimized 5 independent variables on the reaction. No additional constraints were imposed on the algorithm to avoid the influence of manual decision-making on its optimization. The optimization range was widely set to maximize yields. More accurate parameter combinations were obtained by refining the parameter optimization step, resulting in a parameter space of 257712000, consisting of 118 × 10 × 78 × 20 × 140. The liquid retention time in a standard PFA coiled-tube reactor with a 10 mL capacity is represented by ‘time’, with an optimization range of 2–80 minutes and a step of 1 minute. ‘Equiv of anisamide’ represents the equivalents of 4-methoxybenzamide, with an optimization range of 1–2 eq. and a step of 0.1 eq. The ‘temperature’ refers to the reaction temperature in the standard PFA coiled-tube reactor. The optimization range is 30–148 °C with a step of 1 °C. The ‘equiv of base’ is the equivalent of DBU, with an optimization range of 1–3 eq. and a step of 0.1 eq. The ‘equiv of ligand’ is the equivalent of xantphos, with an optimization range of 6–20 mol% and a step of 0.1 mol%. The algorithm will search for the optimal parameters within this parameter space combination.
The optimization flow for the process parameters of flow synthesis is shown in Fig. 1c. Initially, the GPR model is trained using a prior dataset. The algorithm then predicts the optimal yield and corresponding experimental conditions by selecting the best model based on the minimum mean square error (MSE). The experimental data obtained are used as parameters for the next round of flow synthesis experiments, and the actual experimental yields of the predicted parameters are obtained. Subsequently, the predicted yield's consistency with the actual yield is assessed. If the predicted yield matches the actual yield or the algorithm provides unchanged predictions of the experimental parameters four consecutive times, it will be recognized as the end of the optimization process. Conversely, the prior dataset will be updated, and the newly obtained experimental parameters with the actual yield will be incorporated into the prior dataset along with the previous experimental data for the next round of training until the end of optimization. Given the specific challenges associated with training on small datasets, a GPR model was chosen for the regression model algorithm. This choice was complemented by implementing an n-fold cross-validation method to mitigate the potential drawbacks of limited training data, which could otherwise undermine the model's generalization capabilities. Additionally, Bayesian hyperparametric optimization was utilized to ensure the model's optimal state, enhancing its generalization ability. The importance of the kernel as the core component of the GPR model for model training and prediction is significant. The prediction kernel is chosen from the ergodic radial basis function (RBF) kernel, the rational quadratic (RQ) kernel, the matern kernel, and the DotProduct kernel. To the best of our knowledge, at the time of writing, these four kernels encompass all the options available in the scikit-optimize library for kernel-based optimization. The selection is based on the MSE criterion. The selected kernel exhibits the most accurate prediction and the strongest generalization ability for the current dataset. With this design, the model implements a self-optimizing function in iterative optimization and data modeling.
To verify whether the designed GPR model is suitable for small sample datasets and to explore the model's adaptability to even less prior data, we investigate the effects of different amounts of prior data on the subsequent optimization process by varying the initial number of prior data points. In previous cases, methods such as full factorial design, Latin hypercube sampling, and maximum coverage initialization have often been used to design prior data. Adopting such global prior data design strategies can enhance optimization efficiency and avoid falling into local extrema traps.
Although better prior data sampling methods facilitate the optimization process, the algorithm should not rely excessively on more uniform prior data sampling. In research, scientists may directly use existing data for reaction optimization. However, such data may have issues like missing information or uneven distribution, making it unsuitable as high-quality training data. Redesigning experiments for reactions to obtain high-quality training data inevitably incurs higher experimental costs. Training with varying quality data poses greater challenges, but it is more aligned with practical reaction optimization goals and cost reduction.
To fully demonstrate the effectiveness of the designed self-optimization algorithm, adaptive strategies are applied only during the model training phase, and an acquisition function is not designed. The optimal experimental parameters are obtained by traversing the entire parameter space within the best model. This approach aims to test the effectiveness of the self-optimization algorithm and the feasibility of not using an acquisition function.
The experiment involved synthesizing under 14 randomly generated parameter conditions, with yields measured and used directly as prior data without undergoing iterative processes. Efficient optimization should also minimize the time cost in the initial data acquisition phase. Therefore, a further selection of 5 datasets with the shortest retention time from the 14 sets of prior data was made for comparison in optimization experiments (refer to Tables S1 and S2†). These 5 datasets may contain less information about the feature parameters, posing a challenge for model optimization. The impact of the quantity of prior data on subsequent optimization is examined by comparing the results of optimization performed using 14 sets of prior data with those using only 5 sets of prior data separately.
The model's predicted values gradually converge to the true values through successive iterations, a process known as convergence. However, exceptionally high or low predicted values may occur during this process. Even if the predicted value matches the true value at a given point, it may be due to locally extreme values rather than indicating a globally optimal solution. The model is considered to be converged only when the predicted values match the actual values or the same reaction conditions are output for four consecutive times. Fluctuations in the model's predicted yields are normal during this process. The optimization experiment started with 14 sets of prior data, running 29 experiments to determine the best experimental parameters. In comparison, the optimization experiment started with 5 sets of prior data, running 21 experiments to determine the best experimental parameters (Fig. 2). Both sets of prior data converged to the same reaction conditions, resulting in an actual yield of 79.1% at a retention time of 80 min, a reaction temperature of 148 °C, an anisamide equivalent of 1.9 eq., a base equivalent of 2.9 eq., and a ligand equivalent of 19.9 mol%. The inexpensive palladium acetate catalyst was limited in this process, and no special anaerobic treatment was applied to the reaction system. The optimal parameters were efficiently screened out using the constructed Bayesian algorithm with a self-optimization function, and high yields were obtained. When conducting reactions of the same type, the yield is typically observed to range from 10% to 75%, while our final yield was achieved at 79.1%.36,41 The synthesis of pyridinylbenzamide was achieved efficiently in this reaction without the need for a special palladium catalyst and in a relatively closed environment provided by continuous flow synthesis. Thus, selecting the palladium catalyst and creating an anaerobic environment are not crucial factors that determine this reaction.
The convergence process of the algorithm is analyzed. The initially predicted yield was 111.62% for 14 sets of prior data, while the corresponding experimental value was 71.6%. For 5 sets of prior data, the initially predicted yield was 366.35%, but the corresponding experimental value was 21.2% (Tables S3 and S4†). The initial model predicted a yield exceeding 100% due to the algorithm not being pre-trained. The discrepancy is attributed to the limited training data, especially the optimization with 5 sets of prior data, resulting in a significant error, poor generalization ability and inaccurate yield prediction. The algorithm converged the yield to within 100% in 1 iteration optimization with 14 sets of prior data, while it required 3 iterations with 5 sets of prior data due to the smaller training dataset. 14 Sets of prior data were optimized for 29 experiments over 15 iterations, and 5 sets of prior data were optimized for 21 experiments over 16 iterations. The 14 sets of prior data required one fewer iteration to optimize than the 5 sets of prior data, but resulted in 8 more experiments (Fig. 2). Despite the fewer experiments conducted with 5 sets of prior data compared to 14 sets, both achieved the same reaction conditions. Having more prior data can reduce the number of iterations but may not necessarily decrease the overall experimental costs, including the construction of the prior dataset. Although the same experimental conditions were obtained, the predicted yields are different. With 14 sets of prior data, the final algorithm converged the yield to 70.22%, whereas with 5 sets of prior data, it reached 74.09%, while the actual yield was 79.1%. Although more training data did not improve predictive accuracy, and the MSE output during model training did not significantly outperform the results obtained from training with a smaller amount of data, they can output the same experimental parameters, indicating the dynamic adjustment capabilities of the algorithm, allowing for self-optimization during successive iterations of the optimization process.
Fig. 3 demonstrates the progress of two sets of optimization experiments. During the design of these optimization experiments, the prior data for both sets were not evenly distributed across the entire parameter space but were instead concentrated in one corner of the space. Taking the 14 sets of prior data as an example, the prior data only contained experimental information with shorter retention times. However, after the first training prediction, the algorithm captured the information that longer retention times are beneficial for increasing yield, without limiting the search to the region of low retention times. As the optimization progressed, the model searched different corners of the entire parameter space and gradually identified the impact of various features on the objective function. The 5 sets of prior data provided less information compared to the 14 sets, posing a greater challenge for model training. During optimization with 5 sets of prior data, the algorithm did not accurately recognize the impact of retention time and reaction temperature on yield as it did with 14 sets of prior data. For example, reaction parameters such as lower temperature and shorter reaction time are provided. Nevertheless, as the optimization progressed, the algorithm actively explored the corners of the parameter space and ultimately achieved the same optimization results as with 14 sets of prior data. This also demonstrates the effectiveness of applying an adaptive strategy to model training. Additionally, even without using an adaptive acquisition function, the unknown objective function can be captured by traversing the parameter space within the model, thanks to our special design of hyperparameters during model training.
Fig. 3 Buchwald–Hartwig reaction optimization results. (a) Reaction profile optimized with 14 sets of prior data. (b) Reaction profile optimized with 5 sets of prior data. |
The kernel function is a crucial element in GPR modeling, defining the similarity between data points and influencing the model's predictive and generalization capabilities. By mapping the data points in the input space to a high-dimensional feature space, the kernel function better reflects the distances between the data points and their similarities. In a GPR model, the algorithm utilizes the kernel function to calculate kernel function values for each data point. These kernel function values are then used as inputs to predict the output of new data points. RBF kernel is commonly used for GPR models. However, when dealing with higher dimensional data, the distance between data points may be small, which can result in poor generalization ability of the GPR model constructed by the RBF kernel. To address this issue, we allow the algorithm to iterate through four kernels during successive training sessions to improve generalization ability. The RBF kernel, RQ kernel, Matern kernel, and DotProduct kernel were selected. The algorithm traverses the four kernels during each training session, utilizing a different kernel for each training session and testing on a test set. Ultimately, the algorithm selects the model with the strongest generalization ability based on MSE and determines the kernel and corresponding hyperparameters. Fig. 4a shows the frequency of different kernels becoming the best when optimization starts with 14 sets of prior data versus 5 sets of prior data. The RQ kernel is the best performing kernel in both cases, followed by the Matern kernel and the dot product kernel, while the traditional RBF kernel performs the worst (Tables S3 and S4†).
Fig. 4 Bayesian optimization algorithm. (a) Frequency of using different kernels in GPR modeling optimization. (b) Flowchart of running GPR model. |
The mathematical expression for the RBF kernel is provided below:
The mathematical expression for the RQ kernel is provided below:
The mathematical expression for the Matern kernel is provided below:
The mathematical expression for the DotProduct kernel is provided below:
k(xi,xj) = σ02 + xi × xj |
The DotProduct kernel differs from the RBF kernel, the RQ kernel, and the Matern kernel because it is a non-stationary kernel. It is parameterized by σ02 and represented as a dot product of two vectors. During the algorithm iteration, the most optimal kernel is selected. The radial basis kernel is preferred for regression problems due to its lower computational complexity and focus on calculating feature distances, making it suitable for high-dimensional data. By adjusting the smoothness of the radial basis function and combining it with other radial basis functions of varying scales, the function can be more accurately tailored to real-world applications.
The algorithm was benchmarked and compared to the advanced Dragonfly algorithm. The tests were conducted on a chemical reaction model based on the Arrhenius equation, proposed by Jensen et al. (For detailed information, please refer to the “computerized benchmarking of algorithms” in the ESI†).17 Different prior data sampling methods were chosen for the tests, including random sampling across the entire parameter space, sampling within a localized parameter space, and Latin hypercube sampling across the entire parameter space. By optimizing with 10 sets of prior data, it was found that neither random sampling nor Latin hypercube sampling had a significant impact on model optimization. The SOBayesian algorithm accurately captured the influence of features on the model, providing reasonable prediction results. Additionally, the impact of experimental errors on the model was verified. A training dataset with a manually introduced 5% error was used. Compared to optimization using correct values, SOBayesian provided the same optimization scheme but with a slight difference in yield prediction (4.56%). This indicates that the designed SOBayesian algorithm can, to some extent, mitigate the influence of experimental errors. It also demonstrates that the algorithm is suitable for both precise, fully automated machine experiments and batch experiments conducted by researchers for reaction optimization. Furthermore, compared to the Dragonfly algorithm, a random seed was set to ensure the reproducibility of optimization.
The influence of different sets of prior data on model training varied. Utilizing 5 sets of prior data for prediction, reaction temperature, xantphos equivalents, and retention time were found to positively influence yield, while anisamide and DBU equivalents had no significant effect on yield prediction (Fig. 5c). Following 16 iterations of optimization, each feature positively affected the yield (Fig. 5d). The initial prediction with 14 sets of prior data showed a negative effect of reaction temperature and anisamide equivalents on yield. However, after 15 iterations of optimization, each feature demonstrated a positive effect on yield prediction, aligning with the performance of the final optimization result starting from 5 sets of prior data. Less training data may negatively impact the model's ability to generalize. For instance, training on only 5 data sets resulted in no discernible impact on anisamide and base equivalents. Similarly, training with only 14 data sets showed no effect on base equivalents. However, after optimization using 21 and 29 sets of data, respectively, both models exhibited positive effects for all features. The initial prediction using 14 sets of prior data showed a negative effect of reaction temperature and anisamide equivalents on yield, attributed to outliers in the 14 data sets, which experimental errors may cause. Ultimately, each feature had a positive impact because the normal training data far outnumbers the anomalous training data, enabling model correction. As optimization progressed, the algorithm addressed the negative impact of human error in the experiment. However, it is crucial to emphasize the importance of data quality. High-quality data can significantly reduce the time and cost associated with training and optimization, thereby facilitating the development of an accurate model.
The importance of each feature before and after optimization using 5 sets and 14 sets of prior data was assessed. With 5 sets of prior data, all features exhibited increased importance after optimization, indicating a growing contribution of each feature to the model's prediction as the optimization proceeds, and the model's ability to fit the model improves with the increase in data. Reaction time contributed the most to the predictions, followed by reaction temperature. With 14 sets of prior data, the contribution of the other four features decreased after optimization, except for reaction temperature, which contributed the most at the end of the optimization, followed by xantphos equivalents. This phenomenon aligns with the results of parameter optimization.
Algorithms with higher feature contributions will generally produce higher predicted values during prediction. The GPR model, constructed by introducing n-fold cross-validation, hyperparameter optimization, and kernel iteration, is suitable for regression prediction in small data sets. This model exhibits good accuracy and generalization ability. In particular, employing a self-optimizing Bayesian algorithm for optimization with 5 and 14 sets of prior data yielded consistent optimization results while reducing the total number of experiments by 27.6%. This significant improvement in efficiency can significantly enhance the optimization of computer-aided continuous flow processes, unlocking further potential for subsequent process optimization in continuous flow synthesis.
Footnote |
† Electronic supplementary information (ESI) available: Supporting information provides reaction parameters and benchmarking data from the SOBayesian optimization experiments. See DOI: https://doi.org/10.1039/d4dd00223g |
This journal is © The Royal Society of Chemistry 2024 |