Adam D.
Clayton†
a,
Jamie A.
Manson†
a,
Connor J.
Taylor†
a,
Thomas W.
Chamberlain
a,
Brian A.
Taylor
b,
Graeme
Clemens
b and
Richard A.
Bourne
*a
aInstitute of Process Research and Development, School of Chemistry & School of Chemical and Process Engineering, University of Leeds, LS2 9JT, UK. E-mail: r.a.bourne@leeds.ac.uk
bAstraZeneca Pharmaceutical Development, Silk Road Business Park, Macclesfield, SK10 2NA, UK
First published on 1st August 2019
Self-optimising chemical systems have experienced a growing momentum in recent years, with the evolution of self-optimising platforms leading to their application for reaction screening and chemical synthesis. With the desire for improved process sustainability, self-optimisation provides a cheaper, faster and greener approach to the chemical development process. The use of such platforms aims to enhance the capabilities of the researcher by removing the need for labor-intensive experimentation, allowing them to focus on more challenging tasks. The establishment of these systems have enabled opportunities for self-optimising platforms to become a key element of a laboratory's repertoire. To enable the wider adoption of self-optimising chemical platforms, this review summarises the history of algorithmic usage in chemical reaction self-optimisation, detailing the functionality of the algorithms and their applications in a way that is accessible for chemists and highlights opportunities for the further exploitation of algorithms in chemical synthesis moving forward.
These advances in the automation of laboratory equipment have simultaneously led to a rise in the use of algorithms in chemistry.10,11 With developments in automation enabling chemists to make better use of the human resource by assigning routine, labour intensive tasks to machines.12,13 More recently, machine learning algorithms have been applied to more challenging tasks, such as the discovery of new chemical reactivity14 and the prediction of reaction outcomes.15
Automated flow systems are able to search large regions of experimental space in relatively short periods of time, making them well suited for optimisation problems.16,17 Optimising processes by combining flow reactors, process analytics and optimisation algorithms is known as ‘self-optimisation’. The reaction mixture is analysed and the responses are supplied to an optimisation algorithm. The algorithm then generates the next set of conditions to explore based on the results of the previous experiments, thereby creating a feedback loop.18 Intelligent analysis of the experimental space reduces the number of experiments required, providing a faster, cheaper and ‘greener’ method for process development. Self-optimising systems provide an enabling technology for efficient optimisation of expensive-to-evaluate chemical systems. As such, algorithms used in self-optimisation typically focus on minimising the number of experiments and material consumed during the optimisation process. Given that the algorithm selected by the user has a significant influence on the efficiency of the optimisation, there will remain a continued interest in the development of algorithms for self-optimising systems.
Self-optimising systems are developed at the interface between chemistry, chemical engineering and computer science. For self-optimising systems to become more commonplace in research laboratories, the end-user (e.g. chemists) require a basic knowledge of the types of algorithms available. However, reviews in this area have to date focused primarily on the monitoring of reactions, and less on the algorithms employed.19,20 In this review, we provide an overview of the algorithms which have been used for self-optimisation to date (Fig. 1), including explanations designed to aid chemists in their choice of the most appropriate algorithm for a given synthetic challenge.
Methods to improve the standard rigid DoE design, to allow for adjustment based on the responses from a given process, have been attempted in a self-optimising chemical environment. The Jensen group first utilised an optimal DoE approach for the optimisation of the alkylation of 1,2-diaminocyclohexane for discrete (e.g. solvents, catalysts, ligands) and continuous variables (e.g. reaction time, temperature, reagent equivalents).24 The optimisation was initialised using a screening set of experiments, from which a linear response surface was fitted. In performing only screening experiments, elucidation of potential response curvature is not possible, however, this is contrasted by the conservation of resources. Following initialisation, further experiments were performed focusing on determining the optimum point for each solvent. Fitting of a quadratic model was performed and a paired 2-sample t-test allowed for the elimination of poorly performing solvents. Optimal regions for the remaining solvents were determined by applying G-optimality, which aims to design experiments which minimise the maximum variance of the models predicted values.25
The algorithm has been further refined, initially enhancing the handling of discrete variables via the addition of a mixed integer nonlinear programming (MINLP) approach,26 with discrete variables removed when performance was poor. MINLP refers to optimisation tasks involving both continuous and discrete variables, with nonlinearities in the response. Further alterations were made to improve the initial space filling experimental design, being led by D-optimality, with additional improvements to the discrete variable reduction process.27,28 D-optimal designs seek to minimise the covariance (uncertainty) of the parameter estimates for a specific model.29 For the catalytic reaction studied, the authors assumed the system could be modelled as a logarithmic model derived from the assumption that the reaction had a single rate determining step. The requirement for an assumed model derivation could present an issue for the application of the algorithm in kinetically complex systems and for general purpose use without a priori knowledge.18 Additionally, the use of a paired 2-sample t-test for solvent elimination has the potential to miss the best conditions due to synergistic effects. To date this algorithm presents the only documented self-optimisation of both discrete and continuous variables in a chemical system, with scope for the field to expand in this area.
Fig. 2 The different geometric transformations of the Nelder–Mead simplex: inside contraction (XIC), multiple contraction (MC), outside contraction (XOC), reflection (XR) and expansion (XE). |
Fig. 3 shows an example of how NMSIM can find the local minimum of a response function in a two-variable design space. The initial vertices of simplex 1 are evaluated via the response function, then the worst vertex is replaced via reflection to form simplex 2. Similarly, the worst vertex of simplex 2 is replaced via reflection to form simplex 3. These successive transformations enable the simplex algorithm to converge on an optimum. The optimisation typically stops when a better response function evaluation cannot be found, indicating that a local optimum has been identified. One of the first times this algorithm was applied to self-optimisation was in the Heck reaction, and represented one of the earliest examples of a self-optimising chemical platform.31 A publication by Krishnadasan et al. from the same year shows the use of simplex methods for optimisation of nanoparticle production. The authors utilise a dynamic simplex to apply compensation in the case of system-drift (unforeseen changes in the system).32 Further work by Cronin and co-workers successfully coupled the algorithm with in-line NMR to optimise an imine synthesis, with a total of 29 experiments, utilising a custom cost function.33
Fig. 3 An example of a two-variable design space with arbitrary variables and a mapped response surface, showing how NMSIM converges on the minimum. Minima (blue), maxima (red). |
The super modified simplex algorithm (SMSIM) is an adaptation of NMSIM, originally introduced by Routh et al.,34 whereby polynomial fitting of data points determines the optimum simplex transformations. As these transformations are based on predictions from the polynomials generated, the algorithm can accelerate across areas of low interest. Fig. 4 shows an initial simplex formed of data points 1, 2 and 3, where 3 is the worst result. The midpoint of 1 and 2 is then measured as 4, which is the ‘centroid’. The centroid is the point at which vertex 3 will be reflected through at distance XRα to vertex 5, which is also measured. A second order polynomial through data points 3, 4 and 5 is then constructed and extrapolated to identify the optimum expansion distance XRβ to vertex 6.35 Many notable modifications to NMSIM have been reported by Felpin et al. which were used in the self-optimisation of the Heck–Matsuda reaction and in the natural product synthesis of carpanone, over four stages with a total of 66 experiments.36,37 These modifications include: boundary and linear constraints on variables (such as temperature, or temperature given a concentration), dimensionality reduction upon exploring boundaries, dimension recovery to re-enter the design space (Fig. 5), diversification by searching unexplored regions and the ability to have multiple termination criteria (such as when all material has been consumed). Other modifications to NMSIM such as complex method38 have also been used for self-optimisation, such as in the amidation of 3-cyanopyridine.39 This algorithm works in a very similar way to SMSIM, however, instead of basing the expansion distances on predicted optimal regions via polynomial fitting, an iterative process is employed. This iterative process begins by measuring a vertex at a given expansion distance. If the measurement is worse than the current data points, it is rejected. Additional measurements are then taken at incrementally shorter distances along the same direction, until a better evaluation is found. Gradient-based methods are another form of local optimisation that typically converge on the optimum by following the initial trajectory of the local response surface. The steepest descent method40 is a gradient-based algorithm first used in a chemical system by McMullen and Jensen in combination with DoE for the optimisation of a Knoevenagel condensation reaction.41 In this method, an initial 2k orthogonal design (where k is the number of variables) or central composite design (CCD) DoE is performed around a particular starting point. A local response surface is then modelled, from which the gradients are calculated. Further experiments are performed along the trajectory of the gradient until the response function value worsens, indicating that the optimum has been passed or that a change of search direction is necessary to proceed, as shown in Fig. 6. Modifications to the steepest descent method, such as conjugate gradient and Armijo conjugate gradient, have also been used in the self-optimisation of the Paal–Knorr synthesis.42
The conjugate gradient algorithm utilises the weighted sum of the last search direction and the direction calculated via the steepest descent method to determine the next iteration. This prevents large direction adjustments which are less likely to lead the algorithm through difficult response surface terrain.43 The Armijo algorithm differs by implementing an Armijo-type line search.44 This varies the step size along each trajectory, which was shown to out-perform the other steepest descent algorithms by reaching a similar optimum in fewer experiments.41,42 Where gradient information is available this can offer faster convergence, however, this can be complicated in the presence of experimental noise and the need to fit a mathematical surface. Real-time analysis of transient experiments in the future may enable experimentally measured gradients to be utilised although to the authors' knowledge this has not been demonstrated yet.
Many local optimisation algorithms are typically fast to converge on an optimum, as each successive iteration of the algorithm makes progressive improvements by moving perpendicularly to the contour of the response-surface. Variants of both the simplex and steepest descent algorithms have been shown to converge on optima within a self-optimising reaction system, each with their own advantages in terms of robustness and experimental efficiency. The disadvantage of using local optimisation tools is when there is no a priori knowledge about the reaction system; as the complexity of the system arising from variable–variable interactions can lead to multiple optima. In these cases, there is no guarantee that the global optimum will be found over a local optima before termination.
When considering a local optimisation algorithm it must therefore be assumed that the chemical system of interest has a single optimum, otherwise a global optimisation tool may be more relevant.
Consideration of optimisation time may also be worthwhile when selecting a local or global optimiser. Due to global searching, experimental points are on average further apart in the experimental space, meaning the reactor system can take correspondingly longer to reach the new conditions.45
Stable Noisy Optimisation by Branch and Fit (SNOBFIT) is a global optimisation algorithm for bound constrained noisy optimisation of objective functions.46 To date, this is the only single-objective optimiser which has been successfully implemented in self-optimisation. It is a derivative-free optimisation method, which means that it requires no gradient information of the objective function being optimised. The algorithm uses a combination of stochastic linear and quadratic surrogate models to determine the optimum point of the system. A surrogate model is an approximate model that maps the process inputs to an objective function.47 They provide a cheap alternative which can be called in lieu of the parent function to improve optimisation efficiency, with the stochastic nature of the models enabling the algorithm to handle noise effectively.
A basic flow diagram detailing a simplified overview of SNOBFIT is shown in Fig. 7. The algorithm generates points in five different classifications:
- Class 1: the point that minimises the local quadratic model around the current best point (xbest). It contains at most one point
- Class 2: are points that are approximate local minimisers. If there are no local points then no points in class 2 are generated
- Class 3: are points that are approximate nonlocal minimisers
- Class 4: are points in regions that are yet to be explored
- Class 5: are points that are randomly generated to fill the design space. They are only generated if the number of evaluated points is less than the number required. The number required is set by the user upon initialisation.
The first documented use of a self-optimising chemical platform utilised SNOBFIT as the optimisation algorithm.48 The authors optimised, within 100 measurements, for a target wavelength at the outlet of the reactor which corresponded to the desired nanoparticle properties. The algorithm was selected due to its global nature and ability to handle noisy data, making it an ideal fit to optimise complex systems such as the synthesis of nanoparticles.
The main advantage of using SNOBFIT is the higher confidence that the optimum found will be the global for the system. Fig. 8 illustrates this advantage where SNOBFIT is able to determine the region of the global optimum whereas the simplex method gets stuck in a local minimum. Given the global nature of the algorithm it can require an increased number of iterations to converge upon the optimum when compared with local methods.45 This is due to the random search element of the design which explores regions for which the objective function has not been evaluated. This coupled with the variety of literature sources documenting its use and the robustness of the algorithm in the presence of noise has led to SNOBFIT being used in self-optimising platforms41,49–52 as well as a wide range of other optimisation tasks. It has been noted that the algorithm can struggle for high dimensional problems (when the input dimension >9).53 This could present issues when applying the algorithm to telescoped reactions, where the input variable space is large. However, this is not normally an issue for single stage synthesis. In addition to issues with high dimensionality SNOBFIT can only be used for the optimisation of continuous variables. This removes the possibility for optimisations containing both discrete and continuous variables.
Fig. 8 Comparison of SNOBFIT (orange squares) and simplex (black dots) for the minimisation of a complex function, restricted to 30 evaluations. Global minimum is indicated by a blue cross. |
An alternative approach to system optimisation is to utilise kinetic model fitting. The models can then be used to suggest optimal operating conditions, provided that the kinetic equations are deconvoluted from mass transfer effects. Suggesting optimal experiments for kinetic model elucidation can be performed through model-based design of experiments. As this approach does not directly search for optimal system conditions it will not be discussed here, however, the reader is directed to the following examples should they require further information.54–56
The actual solution to a multi-objective optimisation problem is a set of non-dominated solutions called the Pareto front, where a non-dominated solution is one which cannot be improved without having a detrimental effect on the other.58 Hence, a constrained multi-objective maximisation problem where variable vector x = {x1, …, xn} is formulated as follows. In the objective space, find variable vector x* which maximises K objective functions y(x*) = {y1(x*), …, yK(x*)}, where the objective space is restricted by bounds on the variables. A feasible solution a dominates another feasible solution b when yi(a) ≥ yi(b) for i = 1, …, K and yj(a) > yj(b) for at least one objective j (Fig. 9).59
One approach to multi-objective optimisation is scalarisation, where objectives are combined into a single objective function with different weightings, w [eqn (1)].
(1) |
This was used by Fitzpatrick et al. to simultaneously optimise throughput, conversion and consumption for an Appel reaction.39 An alternative approach by Krishnadasan et al. utilised a weighted-product objective function for the optimisation of CdSe nanoparticles.48 For both methodologies, it is difficult to define suitable weightings without substantial a priori knowledge, and minor changes to these weightings can result in significant changes to the solution obtained. Furthermore, weighting methods fail to reveal to complete trade-off in a practical number of experiments, as only one Pareto optimal solution is identified per optimisation.60 Further work by Walker et al. utilised a constrained optimisation approach to optimise a primary objective whilst constrained within the predefined bounds applied to other objectives.61 Although this approach provides an improved means of scalarisation compared to previous work, it still fails to identify complete trade-off between objectives. In contrast, evolutionary algorithms, such as non-dominated-sort genetic algorithm (NSGA-II), are designed to converge on the Pareto front using a Pareto dominance ranking system.62 However, the requirement of a large population size has deterred their use in self-optimisation.
Bayesian optimisation is a broad category of derivative free (requires no gradient information) global optimisation methods that utilise surrogate models to optimise expensive-to-evaluate objective functions.63 The surrogate model is built using sampled data from the process/objective to be optimised. Understanding the mechanistic concepts behind the input–output relationship is not important at this stage, with the model considered to be a ‘black box’. Once constructed, the surrogate model is utilised in conjunction with an acquisition function to suggest the next evaluation point, to maximise or minimise an objective function.64 An acquisition function is a function which balances exploration (searching regions which currently have no data points) and exploitation (focusing near regions of known good performance) and is maximised after each iteration to determine the next sampling point. Often the surrogate model will be of the form of a Gaussian process (GP) which is computationally and resource-wise more efficient to evaluate than the actual process. A GP model is a collection of random variables, for which any discrete point has a Gaussian (normal) distribution.65 A GP model is characterised by a mean function and a covariance function. The mean function defines the expected output for a given set of inputs, with the covariance function describing the statistical relationship between two points in the input space. Points that are close together are considered similar and this is reflected in the covariance function. A noise term can be introduced when calculating the covariance for the process. This enables Bayesian optimisation algorithms to handle noisy data associated with experimental platforms. The requirement for hyperparameters (algorithm settings) can be considered a drawback of Bayesian methodologies. The setting of these hyperparameters can play a significant role in determining the goodness of fit for the GP model and can in turn affect the performance of stochastic optimisation algorithms that use GP at their core. One example of a hyperparameter is the type of covariance function used in fitting the GP model. Fig. 10 highlights the impact of covariance function on how well the model fits the data. Ensuring hyperparameters are optimised and robust is key to developing a Bayesian optimisation algorithm. The libraries of GPy66 and GPyOpt67 were used for sampling and example optimisations.
Fig. 10 Comparison of fit for two Gaussian models with different covariance functions: Matern 5/2 (red) and squared exponential (blue). Training data is shown as black dots. |
An acquisition function is used to determine the next evaluation point in Bayesian optimisation. Fig. 11 illustrates the sequential optimisation approach adopted by Bayesian optimisation techniques, with each figure representing an iteration of the algorithm. The example shown is for a single-objective problem but can be extended to multi-objective optimisations. For each iteration the acquisition function is calculated based upon the current available data and the surrogate model for the process. There are many different acquisition functions used in Bayesian optimisation tasks, with the development of acquisition functions being a constantly evolving field.68 Some of the more common acquisition functions are expected improvement, probability of improvement, upper confidence bounds and Thompson sampling.69,70 Mixtures of acquisition functions can also be used. Expected improvement was used for Fig. 11, with the algorithm initially exploring the search domain and then focusing on the region where the function optimum is likely to be.
The multi-objective active learner (MOAL) was designed for expensive-to-evaluate multi-target optimisation tasks.71 The algorithm was successfully applied to the optimisation of an emulsion polymerisation recipe with 14 input variables, simultaneously targeting a conversion of ≥99% and particle diameter of 100 ± 1 nm.72 A similar machine learning methodology was used to design the Thompson sampling efficient multi-objective optimisation (TSEMO) algorithm.73 The TSEMO algorithm was shown to compare competitively with other multi-objective Bayesian optimisation algorithms such as Pareto efficient global optimisation (ParEGO)74 and expected hypervolume improvement (EHI).75 In addition, it can readily be used for batch-sequential design making it well suited for integration with self-optimising platforms.
The algorithmic procedure is displayed in Fig. 12. Initially, a small dataset is collected using a random space-filling set of experiments, which is then used to build a GP surrogate model.76 The algorithm randomly samples from the GPs and uses the NSGA-II algorithm to identify the Pareto front of each random sample (Thompson sampling). Through randomly sampling this accounts for the exploration/exploitation trade-off desirable in Bayesian optimisation. The candidate experiment which gives the largest predicted hypervolume improvement is then performed, where hypervolume is defined as the volume of objective space dominated by the current set of solutions (Fig. 13). The reference point, R, for this calculation is usually defined as the anti-utopian point (the worst point with respect to all objective functions). The hypervolume indicator is a favored metric as it considers both the convergence and diversity of the front, where diversity refers to how well-distributed the optimal solutions are along the Pareto front.77 The combined use of random GP sampling and hypervolume improvement accounts for the desired trade-off between exploration and exploitation respectively. The GP surrogate model is then updated and the procedure repeated iteratively until the predefined maximum number of experiments is reached. Research in our group applied the TSEMO algorithm for the self-optimisation of an aromatic nucleophilic substitution (SNAr) and N-benzylation reaction, focusing on E-factor, space–time yield (STY) and impurity profiles as objectives.78 The algorithm was able to converge on the Pareto front in a similar number of experiments, 68 and 78 experiments respectively, as previously reported single-objective optimisations, thus providing a greater amount of information per experiment. Notably, identification of a set of solutions and presentation of a front enables a posteriori decisions to be made regarding the desired development, where process specifications are often dynamic. The inclusion of discrete variables in future work will broaden the scope of multi-objective self-optimisation to a wider variety of processes.
Fig. 12 Flow diagram for the TSEMO algorithm. Neval is the current number of evaluations and Nmax is the maximum number of evaluations. |
Footnote |
† These authors contributed equally to the preparation of the manuscript. |
This journal is © The Royal Society of Chemistry 2019 |