Jonas
Verhellen
Centre for Integrative Neuroplasticity, University of Oslo, N-0316 Oslo, Norway. E-mail: jverhell@gmail.com
First published on 2nd June 2022
Computer-assisted design of small molecules has experienced a resurgence in academic and industrial interest due to the widespread use of data-driven techniques such as deep generative models. While the ability to generate molecules that fulfil required chemical properties is encouraging, the use of deep learning models requires significant, if not prohibitive, amounts of data and computational power. At the same time, open-sourcing of more traditional techniques such as graph-based genetic algorithms for molecular optimisation [Jensen, Chem. Sci., 2019, 12, 3567–3572] has shown that simple and training-free algorithms can be efficient and robust alternatives. Further research alleviated the common genetic algorithm issue of evolutionary stagnation by enforcing molecular diversity during optimisation [Van den Abeele, Chem. Sci., 2020, 42, 11485–11491]. The crucial lesson distilled from the simultaneous development of deep generative models and advanced genetic algorithms has been the importance of chemical space exploration [Aspuru-Guzik, Chem. Sci., 2021, 12, 7079–7090]. For single-objective optimisation problems, chemical space exploration had to be discovered as a useable resource but in multi-objective optimisation problems, an exploration of trade-offs between conflicting objectives is inherently present. In this paper we provide state-of-the-art and open-source implementations of two generations of graph-based non-dominated sorting genetic algorithms (NSGA-II, NSGA-III) for molecular multi-objective optimisation. We provide the results of a series of benchmarks for the inverse design of small molecule drugs for both the NSGA-II and NSGA-III algorithms. In addition, we introduce the dominated hypervolume and extended fingerprint based internal similarity as novel metrics for these benchmarks. By design, NSGA-II, and NSGA-III outperform a single optimisation method baseline in terms of dominated hypervolume, but remarkably our results show they do so without relying on a greater internal chemical diversity.
In current molecular generative model benchmarks,13 typically either the arithmetic mean or the geometric mean of the objective is chosen as a stand-in aggregate fitness function. To give relative importance to the different objectives, domain experts can assign weights to them or combine appropriate modifying functions to obtain a single, fine-tuned objective function. However, many fields of science and engineering make use of an alternative approach to multi-objective optimisation by searching for a set of so-called Pareto optimal solutions.19 All solutions in a Pareto optimal set are characterised by the fact that there are no other individual solutions that have a higher (or equal) fitness in all objective functions. Together, the set of Pareto optimal solutions form an optimal envelope in objective space known as the Pareto front, see Fig. 1.
The Pareto front provides a family of solutions, all equivalent in principle, aiding domain experts to make choices when trade-offs between objectives are not known beforehand. Over the past two decades, a set of algorithms known as the non-dominated sorting genetic algorithms20 (NSGA) has been developed for finding Pareto fronts. In a complex process, such as drug design, having access to a technique complementary to single objective optimisation, can yield deeper insights and improve efficiency. Therefore, in this paper, we provide the community with state-of-the-art and open-source implementations of the NSGA-II and NSGA-III algorithms21–23 based on a popular graph-based genetic algorithm24 (GB-GA) for molecular optimisation.
A newer generation of NSGA algorithm, NSGA-III, which uses a more complex means of ensuring coverage of the entire Pareto front, was originally reported to be an improvement over NSGA-II. However, later analyses25,26 have shown that for a wide range of computational experiments NSGA-III does not consistently outperform NSGA-II in every use-case. Therefor we compare the performance of NSGA-III and NSGA-II on a set of small molecule multi-objective optimisation benchmarks, making use of the dominated hypervolume as a novel measure of the effectiveness in these type of problems. As a baseline, we make use of a state-of-the-art single-objective optimisation algorithm that employs the geometric mean as a surrogate aggregate fitness function. Whereas proprietary applications of NSGA-II to molecular design have been reported,27,28 there is a lack of open-source implementations of both NSGA-II and NSGA-III for the inverse design of small molecules. We anticipate that our results and the availability of the code will encourage the development of more powerful Pareto optimisation algorithms for chemistry as well as their widespread adoption in computer-assisted chemical design.
For small molecule optimisation, these ideas can be implemented by representing solutions (i.e. molecules) by either their molecular graphs, or by text representation such as the simplified molecular-input line-entry system32 (SMILES) or self-referencing embedded strings33 (SELFIES). The graph representation has been used in the graph-based genetic algorithm (GB-GA) which was shown to outperform machine learning approaches.24 In Fig. 2, we show examples of mutations and crossovers on molecular graphs. To rule out graphs that represent impossible chemical configurations, only those that can be correctly translated to and from SMILES are retained. The initial population of candidate molecules is typically obtained from public databases like ZINC34 or ChEMBL.35
Alternatively, the superfast traversal, optimisation, novelty, exploration and discovery algorithm40 (STONED) leverages molecular diversity through the use of SELFIES. In contrast to the more traditionally used SMILES, SELFIES can be mutated arbitrarily at any position in the string to produce new strings that represent valid molecular structures. The STONED algorithm uses this property of SELFIES to preserve diversity in its population. By varying the position of modification within the string, the algorithm balances exploration and exploitation to avoid stagnation in low-performing valleys or local optima.
Non-dominated sorting genetic algorithms20 are, in essence, genetic algorithms that evaluate and select on the Pareto dominating status of each solution in the evolutionary population as shown in Fig. 3. Instead of selecting molecules based on a fitness function, these algorithms sort all solutions into a series of fronts, see Fig. 4(a), each front dominated by the previous fronts. The first front (dark blue) is the set of completely non-dominated individuals in the current population, the second front (light blue) is the set of individuals dominated only by the individuals in the first front, and so on for all other fronts formed by the remaining individuals in the population (white). The algorithm accepts the fronts, with all of its individuals, into the evolutionary population in ascending order, until the maximum size of the evolutionary population has been reached.
Fig. 3 Pseudocode description of a generic non-dominated sorting genetic algorithm adapted to the setting of molecular optimisation. |
The final front accepted by a non-dominated sorting genetic algorithm might, and often will, contain more individuals than can be added to the surviving evolutionary population without exceeding its size limit. This set of individuals is known in the multi-objective optimisation community as the splitting front.20 Because there is no difference between the individuals in the splitting front in terms of Pareto dominance, further criteria are used to select which individuals are retained and which are discarded. In the splitting front selection procedure for non-dominated sorting genetic algorithms, this criteria is typically a measure of diversity. The NSGA-II and NSGA-III algorithms both rely on a diversity criteria, but differ significantly in how they enforce this diversity, see Fig. 4(b) and (c).
If a reference direction does not have any solution assigned to it after reaching the splitting front, then the molecule in the splitting front with the smallest perpendicular distance to this direction is selected for survival. If all underrepresented reference directions have been assigned one surviving solution, and the maximum size of the surviving population has not been reached, the remaining solutions are selected by a stochastic procedure. Note that NSGA-III selects the solutions in the fronts before the splitting front in its entirety, like in NSGA-II. However, contrary to NSGA-II's crowding distance which is calculated within the splitting front, the reference directions used in NSGA-III take into account the diversity of the entire surviving population.
To alleviate the issues of the Das–Dennis method, an energy-based approach has recently been proposed43 in the multi-objective optimisation literature. Inspired by methods in physics, a generalisation of the potential energy called the Riesz s-energy45 is calculated for a given number of reference points on the unit simplex. The Riesz s-energy Us is defined between two points p1, p2 in s-dimensional Euclidean space as,
(1) |
The location of the points along of the unit simplex are then optimised to minimise the combined Riesz s-energy of all the reference points. This allows for the construction of an arbitrary number of well-spaced reference directions. The results in this paper were obtained using the Riesz s-energy method to generate the reference directions for NSGA-III, with s equal to the square root of the number of objective functions as suggested in the original paper.43
Similarly, we follow the example of GB-EPI to apply the computational equivalent of in vitro positional analogue scanning46 by repurposing the mutation operator to systematically return not just a single mutation of a molecule, but all of its positional analogues. To offset the computational overhead introduced by positional analogue scanning and to improve efficiency in general, we store a record of obtained fitness calculations. This approach is known as memoisation47 and ensures that an algorithm does not unnecessarily repeat calculations. To further reduce clock time, we also implemented concurrency for the objective function evaluations and remove undesirable compounds based on structural ADMET filters48–50 before they enter the evaluation step of the algorithm.
The objectives in these benchmarks, as shown in Table 1, are either similarity metrics that measure the distance to the corresponding drug molecule, or specific properties such as the amount of rotatable bonds in a molecule, the topological polar surface area52 (TPSA) or the lipophilicity partition coefficient53 (log(P)). The similarity metrics are calculated using the Tanimoto similarity,54,55 of the fingerprints of the target and the generated candidate molecule. The fingerprints used here are either extended-connectivity fingerprints56,57 (ECFP/FCFP) which encode molecular structures in terms of concentric atomic neighbourhoods, or atom-pair fingerprints58 (AP) which encode molecules based on their atom pairs and their bond distance. The main advantage of fingerprint-based similarities compared to more involved similarity measures is that they can be rapidly calculated and inherently represent the presence or absence of molecular substructures or atom pairs.
Task\objective | I | II | III | IV | V |
---|---|---|---|---|---|
Cobimetinib | |||||
Tanimoto(FCFP4) | Tanimoto(ECFP6) | Rotatable bonds | Aromatic rings | CNS(0.5) | |
Clipped(0.7) | MinGaussian(0.75, 0.1) | MinGaussian(3, 1) | MaxGaussian(3, 1) | — | |
Fexofenadine | |||||
Tanimoto(AP) | TPSA | log(P) | — | — | |
Clipped(0.8) | MaxGaussian(90, 10) | MinGaussian(4, 1) | — | — | |
Osimertinib | |||||
Tanimoto(FCFP4) | Tanimoto(ECFP6) | TPSA | log(P) | — | |
Clipped(0.8) | MinGaussian(0.85, 0.1) | MaxGaussian(95, 20) | MinGaussian(1, 1) | — | |
Pioglitazone | |||||
Tanimoto(ECFP4) | Molecular weight | Rotatable bonds | — | — | |
Gaussian(0, 0.1) | Gaussian(356, 10) | Gaussian(2, 0.5) | — | — | |
Ranolazine | |||||
Tanimoto(AP) | log(P) | TPSA | Fluorine count | — | |
Clipped(0.7) | MaxGaussian(7, 1) | MaxGaussian(95, 20) | Gaussian(1, 1) | — | |
DAP kinases | |||||
hERG | SCN2A | DAPk1 | DRP1 | ZIPk | |
Gaussian(0, 0.1) | Gaussian(0, 0.1) | Clipped(0.8) | Clipped(0.8) | Clipped(0.8) | |
Antipsychotics | |||||
hERG | 5-HT2A | 5-HT2B | DRD2 | CNS(0.5) | |
Gaussian(0, 1.0) | Clipped(0.8) | Clipped(0.8) | Clipped(0.8) | — |
The raw scores obtained from similarity or property measurements are post-processed by modifier functions that map the scores to the [0, 1] interval and allow the objective to be fine-tuned. The modifier functions used in this paper are Clipped(value), Gaussian(mean, variance), MinGaussian(mean, variance), and MaxGaussian(mean, variance). The Clipped modifier is a thresholded modifier in which values above a given threshold are mapped to one, while values below threshold decrease linearly to zero. The Gaussian modifiers target a specific value, returning high scores when the underlying value is near the target. The Min and Max versions of this modifier map the input value to one if it is lower or higher than the target value, respectively. For example, in the fexofenadine benchmark a molecule with a Tanimoto similarity higher than 0.8, a TPSA above 90.0 and a log(P) below 4.0 would score perfectly on each objective. More information on the modifiers can be found in the ESI† accompanying the Guacamol paper.13
Precise evaluation of generative models in terms of their value to pharmaceutical drug design programs can be challenging. To increase relevance, with respect to real-life drug design projects, while maintaining the efficient benchmark evaluations necessary for iterative design and statistical analysis, we integrate an existing data-driven surrogate model for target activity into the Guacamol benchmarking suite.13 We make use of a previously proposed surrogate model,59 minding the separation of concerns,60 that has been used to study failure modes in molecule generation. This model ranks molecules based on the ratio of trees in a random forest classifier, trained on ChEMBL activity data,35 predicting that the molecule is active. In the model, binary ECFP fingerprints57 of size 1024 and radius 2 are used as features.
In this paper, we provide two novel benchmarks for Pareto optimisation making use of this model. Inspired by the demands of a multi-target drug discovery project,61 we have constructed a multi-kinase inhibitor task and a multi-neuroreceptor binding antipsychotics task. In the kinase inhibitor task, we aim for molecules that inhibit three DAP kinases62 (DAPk1, DRP1, and ZIPk) often implicated in cancer while trying to avoid activity against common off-target ion channels63,64 (hERG, and SCN2A). In the ongoing search for novel anti-psychotic medication, focus has shifted65 to combined binders of serotonergic receptors (5-HT2A, and 5-HT2B) and a more classical target: the dopaminergic DRD2 receptor. In the multi-receptor antipsychotica task, we target these three receptors, and aim to avoid an off-target ion channel (hERG) while fulfilling the Pfizer central nervous system desirability requirements.
The dominated hypervolume (also known as Lebesgue measure67 or S-metric68) maps a set of points in objective space to the size of the region Pareto dominated by that set. The hypervolume has to be bounded from below by a reference point, which for the purposes of this paper will systematically be chosen to be the origin of objective space. The dominated hypervolume simultaneously takes into account the proximity of the points to the ideal Pareto front and their spread over the objective space. For problems with less than five objectives, the dominated hypervolume can be calculated exactly. However, for higher-dimensional multi-objective optimisation problems, calculating the dominated hypervolume precisely can be computationally expensive and hence a smorgasbord of efficient approximation methods69,70 for the dominated hypervolume has been developed.
In this paper we make use of extended similarity indices to calculate and track the internal similarity of evolutionary populations. Extended similarity metrics, which compare a stack of bitvectors, have the advantage71 that they do not require the full similarity matrix of the compound pool or aggregate metric. In addition to being more efficient, extended similarity metrics reduce to the traditional binary similarity metrics if applied to a set of two molecules. According to computational experiments, two newly proposed extended similarity metrics72 are highly advantageous compared to the extended Tanimoto similarity: the extended Baroni–Urbani–Buser similarity index and the extended faith similarity index. Throughout this paper will make use of the extended faith similarity index.
Based on previous work comparing single objective optimisation methods, we choose GB-EPI (with geometric mean as surrogate fitness function) as a representative baseline to compare against NSGA-II and NSGA-III. For GB-EPI, we choose four medicinally relevant features of interest to span the archive: molecular weight (ranged from 140 to 555), log(P) (0.0 to 7.0), TPSA (0 to 140), and molar refractivity (40 to 130). For fair comparison, molecules exceeding these ranges are excluded from the evolutionary populations of NSGA-II and NSGA-III during the benchmarks. Based on previous experience with GB-EPI, the archive size for was set to 150 and the batch size to 20. The archive size in quality-diversity algorithms, such as GB-EPI, is the counterpart of the population size in traditional genetic algorithms. In general, the batch size refers to the amount of molecules submitted to mutation and crossover per generation. For NSGA-II, we used a population size of 100 (corresponding to the initial population) and a batch size of 20. For NSGA-III, we used the same batch size but experimentation guided us towards a smaller total evolutionary population: we settled on the use 25 reference directions, and a population size of 35 molecules. These hyperparameters were chosen to support global performance of each individual algorithm without disrupting splitting procedures, as a consequence the amount of fitness calls varies across algorithms and generations.
In Fig. 5 the evolution of the dominated hypervolume, maximum geometric mean and internal similarity of the NSGA-II, NSGA-III, and GB-EPI algorithms is shown for two representative benchmarks (cobimentib and fexofenadine). Throughout the computational experiments GB-EPI, which optimises directly for the geometric mean, is used as a baseline comparison method. As expected, NSGA-II and NSGA-III successfully out-compete the GB-EPI baseline in terms of dominated hypervolume for both benchmarks. In contrast to GB-EPI, the NSGA algorithms are designed specifically to optimise the Pareto front, the quality of which is measured by the dominated hypervolume. The geometric mean follows trends similar to the dominated hypervolume in the benchmarks. However, the values of the maximal geometric mean lie close to each other and the 95% confidence interval of GB-EPI overlaps with NSGA-II and NSGA-III during the latter stages of the cobimentib task.
An overview of the results for the multi-objective benchmarks is shown in Table 2 in terms of averages and standard deviations. NSGA-II and NSGA-III perform better than the baseline on each of the benchmarks for both dominated hypervolume and maximum geometric mean with the exception of the antipsychotics task. In that task, similarity between the three receptor targets disadvantages NSGA-III due to its rigid reference directions. For the fexofenadine and pioglitazone benchmarks, GB-EPI lies within one standard deviation of either NSGA-II or NSGA-III for both metrics. Note that to obtain the global maximum geometric mean of these benchmarks or the global optimum of one of the objectives, direct optimisation should be used. In principle, Pareto optimisation algorithms should reach these types of global optima, but significantly less efficiently as the evolutionary population is spread out over objective space. Conversely, when using a single aggregation function, the solutions tend to lie close to each other in objective space, and don't cover the entirety of the Pareto front.
Algorithm | Task | Dominated hypervolume | Geometric mean | Internal similarity | Fitness calls (cumulative) |
---|---|---|---|---|---|
GB-EPI | |||||
Cobimetinib | 0.77 ± 0.05 | 0.93 ± 0.01 | 0.50 ± 0.00 | 13577 ± 1224 | |
Fexofenadine | 0.67 ± 0.07 | 0.87 ± 0.03 | 0.50 ± 0.00 | 17985 ± 1398 | |
Osimertinib | 0.54 ± 0.04 | 0.85 ± 0.01 | 0.50 ± 0.00 | 12982 ± 1351 | |
Pioglitazone | 0.98 ± 0.04 | 0.99 ± 0.01 | 0.50 ± 0.00 | 13160 ± 3104 | |
Ranolazine | 0.46 ± 0.04 | 0.81 ± 0.02 | 0.50 ± 0.00 | 16859 ± 1537 | |
DAP kinases | 0.03 ± 0.05 | 0.46 ± 0.06 | 0.51 ± 0.00 | 23545 ± 3150 | |
Antipsychotics | 0.09 ± 0.02 | 0.57 ± 0.06 | 0.51 ± 0.00 | 21905 ± 3073 | |
NSGA-II | |||||
Cobimetinib | 0.94 ± 0.02 | 0.94 ± 0.01 | 0.51 ± 0.00 | 17784 ± 1753 | |
Fexofenadine | 0.78 ± 0.10 | 0.92 ± 0.04 | 0.52 ± 0.00 | 20268 ± 2909 | |
Osimertinib | 0.66 ± 0.03 | 0.89 ± 0.01 | 0.52 ± 0.00 | 16848 ± 2655 | |
Pioglitazone | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.51 ± 0.00 | 19944 ± 4765 | |
Ranolazine | 0.68 ± 0.06 | 0.87 ± 0.02 | 0.51 ± 0.00 | 21259 ± 2181 | |
DAP kinases | 0.05 ± 0.03 | 0.50 ± 0.07 | 0.52 ± 0.00 | 24350 ± 3826 | |
Antipsychotics | 0.08 ± 0.03 | 0.50 ± 0.05 | 0.51 ± 0.00 | 21246 ± 1909 | |
NSGA-III | |||||
Cobimetinib | 0.92 ± 0.03 | 0.93 ± 0.02 | 0.51 ± 0.00 | 14224 ± 1807 | |
Fexofenadine | 0.79 ± 0.00 | 0.91 ± 0.03 | 0.52 ± 0.01 | 12950 ± 2326 | |
Osimertinib | 0.66 ± 0.03 | 0.89 ± 0.01 | 0.52 ± 0.00 | 11052 ± 2337 | |
Pioglitazone | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.51 ± 0.01 | 10639 ± 2736 | |
Ranolazine | 0.63 ± 0.06 | 0.85 ± 0.02 | 0.51 ± 0.00 | 17949 ± 2732 | |
DAP kinases | 0.04 ± 0.02 | 0.48 ± 0.07 | 0.51 ± 0.01 | 22454 ± 3440 | |
Antipsychotics | 0.05 ± 0.03 | 0.49 ± 0.04 | 0.52 ± 0.01 | 32991 ± 3473 |
To study the comparative efficiency of each algorithm, we track the cumulative number of function calls over the full 150 generations for the twenty individual runs of each algorithm. This has the advantage that it does not interrupt the splitting front procedure, as might be the case when working with a fixed and limited function call budget. An overview of the mean and standard deviation of the cumulative fitness calls of each algorithm is shown in Table 2. NSGA-III consistently outperforms NSGA-II in terms of efficiency, and is more efficient than GB-EPI in all benchmarks where they have similar performance for dominated hypervolume and geometric mean. In contrast to single objective optimisation problems, where a lower internal similarity has been regarded as beneficial, for multi-objective optimisation the algorithms which encourage greater internal similarity are better performing.
The performance of NSGA-II and NSGA-III for graph-based optimisation of molecules is encouraging. Both algorithms specialise in finding the optimal Pareto front and our benchmarks show that this approach is superior compared to GB-EPI (which optimises the geometric mean directly). In line with analyses of purely numerical benchmarks found in the literature, NSGA-III does not always outperform NSGA-II in our chemical benchmarks, indicating that the two algorithm produce similar results according to this metric. Throughout all the benchmarks presented in this paper however, NSGA-III seems to be the most efficient in its use of function calls. Notably, and in contrast to single objective optimisation, the higher performing algorithms NSGA-II and NSGA-III have a higher and faster increasing internal similarity in their evolutionary populations than the baseline.
The above discussed efficiency, performance, and flexibility of the graph-based implementations of NSGA-II and NSGA-III for small molecule multi-objective optimisation as provided with this paper, allows the community to use these algorithms for practical use. In addition, these implementations can be used as future baselines and as starting points for future developments in this field. One such possible development would be to further reduce the amount of function calls through the use of contextual multi-armed bandits,74 or Gaussian processes75 to prune the amount of molecules presented to the evaluation step of the algorithms. Finally, the algorithms presented here can be integrated into the workflow for multi-objective tasks given to self-driving laboratories76 or other set-ups making use of active learning.77
Footnote |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d2sc00821a |
This journal is © The Royal Society of Chemistry 2022 |