AkshatKumar
Nigam‡
abc,
Robert
Pollice‡
bc and
Alán
Aspuru-Guzik
*bcde
aDepartment of Computer Science, Stanford University, USA
bDepartment of Computer Science, University of Toronto, Canada. E-mail: alan@aspuru.com
cDepartment of Chemistry, University of Toronto, Canada
dVector Institute for Artificial Intelligence, Toronto, Canada
eLebovic Fellow, Canadian Institute for Advanced Research (CIFAR), 661 University Ave, Toronto, Ontario M5G, Canada
First published on 3rd May 2022
Inverse molecular design involves algorithms that sample molecules with specific target properties from a multitude of candidates and can be posed as an optimization problem. High-dimensional optimization tasks in the natural sciences are commonly tackled via population-based metaheuristic optimization algorithms such as evolutionary algorithms. However, often unavoidable expensive property evaluation can limit the widespread use of such approaches as the associated cost can become prohibitive. Herein, we present JANUS, a genetic algorithm inspired by parallel tempering. It propagates two populations, one for exploration and another for exploitation, improving optimization by reducing property evaluations. JANUS is augmented by a deep neural network that approximates molecular properties and relies on active learning for enhanced molecular sampling. It uses the SELFIES representation and the STONED algorithm for the efficient generation of structures, and outperforms other generative models in common inverse molecular design tasks achieving state-of-the-art target metrics across multiple benchmarks. As neither most of the benchmarks nor the structure generator in JANUS account for synthesizability, a significant fraction of the proposed molecules is synthetically infeasible demonstrating that this aspect needs to be considered when evaluating the performance of molecular generative models.
Herein, we present JANUS, a genetic algorithm (GA) based on the SELFIES representation of molecules.6,7 It utilizes STONED8 for molecular generation, obviating structural validity checks. Inspired by parallel tempering,9,10 JANUS maintains two populations that exchange members and use distinct genetic operations. One population conducts a local (exploitative) search of the chemical space and uses molecular similarity as selection pressure. The other performs a global exploration and, optionally, uses a deep neural network (DNN) as selection pressure. While not requiring domain knowledge, we added the option to extract structure derivation rules from user-provided sets of molecules automatically. Comprehensive benchmarking shows that JANUS outperforms other generative models significantly in both fitness and number of fitness evaluations.
GAs are population-based metaheuristic optimization algorithms and have a long-standing track record in the natural sciences for tackling complicated design problems.18,19 To the best of our knowledge, the first use of GAs for non-sequence-based de novo molecule design documented in the literature dates back to 1993. Two independent papers presented at the Molecular Graphics Society Meeting on Binding Sites at the University of York by researchers from the pharmaceutical industry described the application of GAs to problems related to drug design.20 One implementation made use of the SMILES molecular representation21,22 and was perhaps the first work to describe both the rediscovery of known molecules based on a molecular similarity metric and the use of a simplified ligand docking approach as benchmarks for molecular design.20 The other implementation demonstrated the optimization of molecular physical properties and was perhaps the first demonstration of designing logP to probe the performance of molecular generative models.20
In subsequent years, several implementations of GAs for molecular design in medicinal chemistry were published using various molecular representations and algorithms to perform structure generation. In 1995, PRO_LIGAND was the first detailed description of a GA for molecular design appearing in a scientific journal. Notably, it only made use of crossover operations. The algorithm identified suitable single bonds to break two parent molecules into two fragments each and subsequently recombine them randomly.23 Importantly, the 3D structure of the molecule was used directly as representation and the method was used to find mimics of the known drugs distamycin and methotrexate by comparing structural features.23 Almost simultaneously, another GA termed Chemical Genesis was described that used diverse mutation and crossover operations.24 Again, the 3D molecular structure was directly manipulated during molecular generation. Additionally, the algorithm implemented several molecular properties such as molecular weight, estimated logP and steric strain energy per atom as scalar constraints allowing to distinguish reasonable from unreasonable drug-like structures.24 Subsequently, this approach was tested designing ribose mimics and ligands for dihydrofolate reductase.24 Only a few years later, around the turn of the millennium, GAs gained significant popularity as molecular generative models and many implementations and case studies were published. Besides using the 3D structure of molecules directly for structure manipulation,25 molecular graphs,26–28 molecular strings,29 molecular fragments30–32 and chemical reactions33 were also employed. Importantly, all these approaches defined hand-crafted rules for mutation and crossover ensuring only valid structures to be generated.
Early GA-based molecular design algorithms have been compared to alternative approaches for drug design in 2005,34 and, over the years, witnessing a host of new methods, more reviews and book chapters have been composed providing a comprehensive overview of that area.35–40 Accordingly, GAs have been applied to molecular design for decades, and they remain one of the most popular metaheuristic optimization algorithms for that purpose. The main difference between previously published approaches lies in the genetic operators, particularly mutation and crossover. Most prevalent are expert rules to ensure validity. Some recent examples of GAs implementing such expert rules include GB-GA,41 CReM,42 EvoMol43 and Molfinder.44 Alternatively, the robustness of SELFIES was exploited in GA + D7 allowing to rely on random string modifications for mutations. This obviates hand-crafted structure modification rules. In addition, GA + D makes use of a DNN discriminator driving molecular generation towards a reference distribution, or away from it. An alternative approach in the family of metaheuristic optimization algorithms is MSO,45 which uses particle swarm optimization in the continuous latent space of a VAE. Furthermore, mmpdb46 implements matched molecular pair analysis (MMPA47) and relies on structure modification rules explicitly derived from a dataset of molecule pairs. As the derived rules are applied stochastically, mmpdb can be considered a metaheuristic optimization algorithm.
Starting from two molecules, an apt crossover resembles both parents. Hence, we form multiple paths between the two molecules (obtained via random string modifications of SELFIES, ESI Fig. 2(b)†), and select structures with high joint similarity (cf. Methods). Consideration of multiple SELFIES representations, multiple string modifications and the formation of numerous paths leads to the generation of many potential children. Importantly, GA + D does not use crossover operators.7
J(m) = logP(m) − SAscore(m) − RingPenalty(m) | (1) |
The corresponding results are summarized in Table 1. We run JANUS with four different variations of selection pressure, first without any additional selection pressure (i.e., randomly sampling the overflow) in the exploration population (“JANUS”), second with selection pressure from a DNN predictor trained to predict the molecular fitness (“JANUS + P”), and third with selection pressure from a DNN classifier trained to distinguish between high and low fitness molecules (“JANUS + C”). For the classifier, we also compare the performance of classifying the top 50% of all molecules as having a high fitness (“JANUS + C (50%)”) against only classifying the top 20% like that (“JANUS + C (20%)”). The optimization progress of JANUS with these four variations of selection pressure is depicted in Fig. 2.
Algorithm | Average of best | Single best | Algorithm | Average of best | Single best |
---|---|---|---|---|---|
a Average of 10 separate runs with 500 molecules of up to 81 SMILES characters per generation and 100 generations. b Average of 5 separate runs with 500 molecules of up to 81 SMILES characters per generation and 1000 generations. c Average of 5 separate runs with 16384 molecules of up to 81 SMILES characters per generation and 200 generations. d Average of 15 separate runs with 500 molecules of up to 81 SMILES characters per generation and 100 generations. | |||||
GVAE48 | 2.87 ± 0.06 | — | GA + Da7 | 13.31 ± 0.63 | 14.57 |
SD-VAE49 | 3.60 ± 0.44 | — | GB-GAa41 | 15.76 ± 5.76 | — |
ORGAN50 | 3.52 ± 0.08 | — | EvoMola43 | 17.71 ± 0.41 | 18.59 |
CVAE + BO51 | 4.85 ± 0.17 | — | GA + Db7 | 20.72 ± 3.14 | 23.93 |
JT-VAE52 | 4.90 ± 0.33 | — | GEGLc53 | 31.40 ± 0.00 | 31.40 |
ChemTS54 | 5.6 ± 0.5 | — | |||
GCPN55 | 7.87 ± 0.07 | — | JANUSd | 18.4 ± 4.4 | 20.96 |
MRNN56 | — | 8.63 | JANUS + Pd | 21.0 ± 1.3 | 21.92 |
MolDQN57 | — | 11.84 | JANUS + C (50%)d | 23.6 ± 6.9 | 34.04 |
GraphAF58 | — | 12.23 | JANUS + C (20%)d | 21.9 ± 0.0 | 21.92 |
As seen in Table 1, JANUS outperforms all alternative approaches in terms of the single best molecule. GA + D merely reaches similarly high results after 10 times the number of generations. Only GEGL performs better on average than JANUS, but with double the number of generations and more than 30 times the number of structures per generation. While the average of the maximum penalized logP across independent runs does not show a large difference after 100 generations for the various selection pressure settings, both the optimization trajectory (cf.Fig. 2(a)) and the single best performing molecules found are significantly impacted. Using selection pressure via the DNN predictor (“JANUS + P′′) or the DNN classifier (“JANUS + C′′) results in populations with higher median fitness than without additional selection pressure (“JANUS”) (cf.Fig. 2(b)). Furthermore, selection pressure from the DNN predictor effects a fast fitness increase as the predictor learns the characteristics of molecules with a high penalized logP score. The respective runs converge to the best-performing linear alkane with 81 carbon atoms, which corresponds to a local optimum. When selection pressure is applied via a DNN classifier, the optimization progress shows a large dependence on the fraction of molecules that are considered top-performing. When only the top 20% are considered top-performing, the respective runs converge to the best-performing linear alkane even faster than with the DNN predictor. However, the narrow fitness intervals of the generations indicate limited exploration of the chemical space leading to convergence to the same local optimum. When the top 50% are considered top-performing, exploration is significantly enhanced resulting in larger fitness variations across runs. This indicates that tuning classifier hyperparameters can be used to switch between exploration and exploitation.
Finally, we also compare the number of property evaluations needed by JANUS and other models to reach three threshold fitness values (cf.Table 2). These results show that JANUS with the most exploitative classifier requires the lowest number of evaluations to reach penalized logP values of 10, 15 and 20, respectively. The reduction in evaluations relative to JANUS without additional selection pressure corresponds to approximately 80%. We need to emphasize here that, as many of the other baselines were taken directly from the literature, detailed information about the number of property evaluations and the optimization trajectories were unavailable. We believe that it will be important for future comparisons of molecular generative models to provide detailed information about the number of property evaluations and the corresponding optimization trajectories. Nevertheless, among all the results and methods provided in the right column of Table 1, which correspond to the better-performing ones, JANUS uses the lowest total number of property evaluations to reach the corresponding results. However, despite a significant improvement, even the most aggressive method of selection pressure used requires 3500 property evaluations to surpass a penalized logP score of 10. This suggests that further work is needed for JANUS to be feasible for the direct optimization of more expensive molecular properties.
To the best of our knowledge, JANUS identified the best-performing molecules ever found by molecular design algorithms that fulfill the 81 SMILES character limit, even outperforming GEGL. When JANUS (specifically “JANUS + C (50%)”) was allowed to run for more than 100 generations, a penalized logP value of 36.62 was achieved at generation 118 in one particular run. The fitness did not increase thereafter, but several isomers with identical fitness were discovered (cf.Fig. 3). It has been claimed that the unconstrained penalized logP maximization task is not a useful benchmark as it has trivial solutions, molecules with ever longer chains.64 However, that is only true without enforcing a SMILES character limit. With the 81 character limit, the trade-off between logP and SAscore results in highly non-trivial solutions as evident from Fig. 3. Importantly, these structures are non-intuitive to design as the exact placement of the non-sulfur atoms in the chain affect the fitness considerably. Notably, JANUS even outperforms the human design demonstrated by Nigam et al.7 proposing a linear chain of sulfur atoms terminated by thiol groups with a penalized logP score of 31.79. In the ESI,† we also ran the benchmark where penalized logP scores are improved for select molecules with the constraint to keep the similarity to the initial structure high (cf. ESI Section S10†).
Fig. 3 Molecules discovered by JANUS with the highest penalized logP score of 36.62 that are within the 81 SMILES character limit. |
Method | Success | Novelty | Diversity | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
A | B | C | D | A | B | C | D | A | B | C | D | |
a Result obtained from using 500 molecules per generation and 278 generations in total. b Result obtained from using 500 molecules per generation and up to 100 generations in total. | ||||||||||||
JTVAE52 | 32.2% | 23.5% | 3.3% | 1.3% | 11.8% | 2.9% | 7.9% | — | 0.901 | 0.882 | 0.883 | — |
GCPN55 | 42.4% | 32.3% | 3.5% | 4.0% | 11.6% | 4.4% | 8.0% | — | 0.904 | 0.884 | 0.874 | — |
GVAE-RL62 | 33.2% | 57.7% | 40.7% | 2.1% | 76.4% | 62.6% | 80.3% | — | 0.874 | 0.832 | 0.783 | — |
REINVENT63 | 99.3% | 98.5% | 97.4% | 47.9% | 61.0% | 31.6% | 39.7% | 56.1% | 0.733 | 0.729 | 0.595 | 0.621 |
RationaleRL62 | 100% | 100% | 100% | 74.8% | 53.4% | 46.2% | 97.3% | 56.8% | 0.888 | 0.862 | 0.824 | 0.701 |
GA + Da64 | 84.6% | 52.8% | 84.7% | 85.7% | 100.0% | 98.3% | 100% | 100% | 0.714 | 0.726 | 0.424 | 0.363 |
JANUS (no fragments)b | 90.6% | 86.4% | 90.4% | 90.2% | 57.9% | 15.9% | 74.9% | 22.8% | 0.850 | 0.807 | 0.681 | 0.728 |
JANUSb | 100% | 100% | 100% | 100% | 80.9% | 40.6% | 77.9% | 32.4% | 0.821 | 0.894 | 0.876 | 0.831 |
JANUS + Pb | 100% | 100% | 100% | 100% | 84.1% | 43.1% | 78.8% | 17.4% | 0.881 | 0.883 | 0.857 | 0.822 |
JANUS + C (50%)b | 100% | 100% | 100% | 100% | 82.9% | 40.4% | 74.4% | 18.4% | 0.884 | 0.894 | 0.877 | 0.841 |
As this benchmark provides a training dataset, we initiated JANUS with these known inhibitors. Additionally, we derived mutation rules from the reference molecules (cf. ESI Fig. 1†). Importantly, this is not necessary and the benchmarks can be run without these custom genetic operators (cf. “JANUS (no fragments)” in Table 3). For these tasks, the fitness is a binary function assigning a value of 1 to molecules fulfilling all constraints. Our results are summarized in Table 3. JANUS achieves perfect, i.e., state-of-the-art, performance for all four tasks in terms of success. While the diversity is also high and comparable to literature baselines, novelty is significantly lower for some of the tasks. Particularly, JANUS shows a significantly decreased novelty when drug-likeness and synthesizability are included in the objective whereas diversity does not seem to be impacted (cf. Task D in Table 3). This suggests that exploration is considerably reduced with JANUS when synthesizability and other structural constraints are incorporated, which is not surprising. We believe that this can be improved by running JANUS for more generations, by implementing a discriminator into JANUS,7 and by seeding the algorithm with methane. Without the incorporation of mutation rules (cf. “JANUS (no fragments)” in Table 3), performance is slightly worse, which indicates that derived mutation rules can bias the results towards the provided set of molecules. However, it also suggests that the classifier is more likely to accept molecules sufficiently close to the training dataset. We were again interested in comparing the number of property evaluations needed by each method to get the corresponding results. However, as all the literature baselines were taken directly from other work, we could not find the corresponding numbers for most of them. Nevertheless, we can say that JANUS (≤50000 property evaluations) needs a significantly lower number of property evaluations to reach higher success rates in the imitated inhibition benchmarks than GA + D (139000 property evaluations).
Interestingly, we observe that many molecules passing the constraints are very large and thus extremely hard to synthesize, indicating both problematic biases in some of these benchmark tasks and the fact that the current version of JANUS does not account for synthesizability during structure generation. A subset of the corresponding structures generated by JANUS using the classifier for additional selection pressure are depicted in ESI Fig. 9–12.† To investigate the synthesizabilities of the generated structures in this set of benchmarks more systematically, we compared the corresponding SYBA score72 distributions with the ones of the reference molecules62 taken from the ChEMBL database.69,70 The respective histograms are depicted in Fig. 4. We observe that the generated molecules have a systematically lower SYBA score compared to the reference, which indicates that these structures are more likely to be hard to synthesize.72 Importantly, using additional selection pressure from the classifier has a tendency to lead to molecules with very low SYBA scores (cf.Fig. 4(a–c)). However, when the SAscore is used explicitly as constraint to be passed in the benchmark, selection pressure through the classifier actually tends to provide molecules somewhat more likely to be easy to synthesize compared to the other selection pressure variations (cf.Fig. 4(d)). This demonstrates the importance of incorporating synthesizability evaluations explicitly in molecular design benchmarks when the molecular design algorithm employed does not account for it. Furthermore, our results show that a significant fraction of the proposed structures have synthesizabilities comparable to a significant fraction of the reference molecules. In addition to the results discussed above, we also compared histograms for the SAscore,67 the SCScore,73 and the RAscore74 based on the same structures, and they are provided in ESI Fig. 13–15.† The corresponding results agree qualitatively very well with the ones based on the SYBA score. As can be seen in ESI Fig. 9–12,† visual inspection of a subset of the structures generated by JANUS using the classifier for additional selection pressure confirms that the inclusion of the SAscore in this benchmark task leads to more synthetically feasible molecules. This demonstrates that the incorporation of synthesizability in the fitness function can compensate for a molecular design algorithm that does not account for it. Hence, we suggest that synthesizabilities should be explicitly compared in the imitated inhibition benchmark task in the future.
Method | 5HT1B | 5HT2B | ACM2 | CYP2D6 |
---|---|---|---|---|
a Result obtained from using 500 molecules per generation and 25 generations in total. | ||||
ZINC (10%) | −9.894 (0.862) | −9.228 (0.851) | −8.282 (0.860) | −8.787 (0.853) |
ZINC (1%) | −10.496 (0.861) | −9.833 (0.838) | −8.802 (0.840) | −9.291 (0.894) |
Train (10%) | −10.837 (0.749) | −9.769 (0.831) | −8.976 (0.812) | −9.256 (0.869) |
Train (1%) | −11.493 (0.849) | −10.023 (0.746) | −10.003 (0.773) | −10.131 (0.763) |
CVAE66 | −4.647 (0.907) | −4.188 (0.913) | −4.836 (0.905) | — (—) |
GVAE48 | −4.955 (0.901) | −4.641 (0.887) | −5.422 (0.898) | −7.672 (0.714) |
REINVENT63 | −9.774 (0.506) | −8.657 (0.455) | −9.775 (0.467) | −8.759 (0.626) |
GA + Da7 | −8.3 ± 0.5 (0.123) | −8.1 ± 0.9 (0.122) | −7.9 ± 0.3 (0.136) | −8.3 ± 0.5 (0.149) |
JANUSa | −9.6 ± 0.9 (0.126) | −9.8 ± 0.7 (0.133) | −8.1 ± 0.5 (0.112) | −9.1 ± 0.4 (0.166) |
JANUS + Pa | −9.9 ± 0.9 (0.132) | −9.8 ± 1.5 (0.166) | −8.0 ± 0.5 (0.125) | −9.3 ± 0.6 (0.194) |
JANUS + C (50%)a | −13.8 ± 0.5 (0.366) | −13.8 ± 0.4 (0.331) | −9.9 ± 0.3 (0.235) | −11.7 ± 0.4 (0.363) |
The results show that JANUS readily proposes molecules with favorable docking scores for each target. In particular, it outperforms REINVENT significantly for all targets in terms of the docking scores, while only slightly for ACM2 when selection pressure is applied via a classifier, more significantly for all other targets. However, the diversities are significantly lower leaving room for improvement. Notably, we limited these runs to 25 generations, which likely contributes to the low diversities. We observe that selection pressure from the DNN classifier yields both significantly lower docking scores and higher diversities demonstrating its supremacy compared to the other selection pressure variations. Both JANUS and GA + D use 12500 property evaluations to reach their respective docking scores, which implies that JANUS needs a lower number of evaluations to reach the same docking scores as GA + D. However, when property evaluations are more time-intensive than molecular docking, e.g., when binding free energies are simulated via molecular dynamics, 12500 is still a prohibitively high number of evaluations. Accordingly, we believe that further improvements are needed in future work. Notably, as all other literature baselines were taken directly from other work, we could not find the corresponding numbers of property evaluations used.
Moreover, we observed that a significant fraction of the proposed molecules contain macrocyclic moieties. This is particularly relevant as the docking algorithm used in this benchmark cannot sample the conformations of macrocycles properly making the corresponding scores unreliable.79 Finally, the molecules produced by JANUS contain several structural features that would be infeasible in real drugs such as extended cummulenes or triple bonds inside small rings (cf. ESI Fig. 20†). Notably, this could be regarded as a problem of JANUS as it does not propose synthesizable structures by construction in contrast to alternative molecular generative models. However, JANUS is designed for the fitness function alone to guide the molecular design process. Hence, when stability or synthesizability are not incorporated in the fitness function, JANUS has no incentive to produce structures with these properties. When they are incorporated, as demonstrated in the imitated inhibition benchmarks, our results suggest that JANUS will propose structures with a high likelihood to be synthesizable. Thus, the problem can be attributed to the molecular design task definition or to the molecular generative model. These are two sides of the same coin. The molecules that are being proposed are not very synthesizable. To generate more synthesizable structures, either the algorithm is made to inherently produce synthesizable structures or the fitness function is modified to incorporate synthethic accessibility metrics, or both. Both approaches are a priori equally valid to reach the same goal. Nevertheless, we think that incorporating metrics to assess stability or synthesizability in molecular design benchmarks is beneficial regardless as it provides a better evaluation of the quality of proposed structures and allows for a more direct comparison of molecular generative models that account for synthesizability in the design process against ones that do not. Thus, we believe that future molecular design benchmarks should be developed in that context. This complicates the comparison of our results to alternative methods as it has been demonstrated to be harder to find molecules with both good docking scores and high synthesizabilities compared to only finding molecules fulfilling the former requirement.80 However, the original set of benchmarks did not incorporate an evaluation of synthesizability which would be required to that end.78 Importantly, this is a more general problem in the comparison of generative models as some of them have implicit structural constraints and biases for molecule generation that are largely absent in approaches like JANUS when not incorporated explicitly.
To follow synthesizabilities systematically along the optimizations, we computed four distinct metrics for that purpose, namely SAscore,67 SCScore,73 SYBA72 and RAscore,74 for all molecules generated in each generation, and compared them to the corresponding values of the reference molecules taken from the ChEMBL database69,70 provided by the benchmark developers.78 The corresponding results are depicted in Fig. 6 and ESI Fig. 16.† Additionally, we also inspected the histograms with respect to these four synthesizability metrics based on the 250 best molecules in each individual run, which is precisely the subset of generated structures used to derive the benchmark scores provided in Table 4. They are shown in Fig. 7 and ESI Fig. 17–19.†
Importantly, these four metrics do not quantify the same aspect of synthesizability which is directly reflected in our results. SYBA is solely based on fragment contributions determining whether a given molecule is hard or easy to be synthesized.72 The SAscore relies on a combination of conceptually comparable fragment contributions and a penalty accounting for, among others, the presence of uncommon moieties, the presence of large rings, and molecule size.67 The SCScore is designed to provide a measure for the estimated number of synthetic steps to make a molecule.73 The RAscore predicts whether or not AiZynthfinder,81 an open-source program for retrosynthesis, will be able to generate a planned synthesis for the structure in question and, hence, is directly related to synthetic feasibility. Both SYBA (cf.Fig. 6 and 7) and the SAscore (cf. ESI Fig. 16 and 17†) provide qualitatively equivalent results suggesting that the structures generated by JANUS are significantly less likely to be synthesizable compared to the reference molecules. Additionally, during the optimization, the molecules tend to get increasingly less likely to be synthetically accessible. While the RAscore (cf. ESI Fig. 16 and 19†) shows a similar trajectory for the molecules proposed by JANUS, i.e., the molecules are getting less likely for AiZynthfinder to find a synthesis plan, a considerable subset of the reference compounds taken from the ChEMBL database, which have all been synthesized before,69,70 are recognized as not retrosynthesizable by AiZynthfinder. Unexpectedly, the SCScore (cf.Fig. 6 and ESI Fig. 18†) gives somewhat different results as it suggests the structures generated by JANUS to be lower in synthetic complexity at the outset compared to the reference structures. Nevertheless, the trend of increasing synthetic complexities during the optimization is still reproduced but the new molecules only reach similar complexities compared to the reference molecules. Importantly, the results of the SYBA, SAscore and RAscore metrics agree well to our visual assessment of samples of the corresponding structures (cf. ESI Fig. 20†) suggesting them to be of poor quality and emphasizing the importance of explicitly accounting for synthesizability, either in the optimization target or in the molecular design algorithm. Accordingly, we propose to refine these docking benchmarks by incorporating the SYBA, SAscore or RAscore metrics in the final benchmark metric or by using stability filters to assess the generated structures so that the proposed molecules are overall more stable and have a higher chance of being synthetically accessible.
The synthetic accessibility score (SAscore)67 was evaluated using the implementation in RDKit.84 The synthetic complexity score (SCScore)73 was evaluated using the standalone importable model, relying on the numpy package and employing the Morgan Fingerprint of radius 2 (ref. 85) folded to a length-1024 bit vector as input, that is provided in the corresponding GitHub repository (cf.https://github.com/connorcoley/scscore, master branch). The Synthetic Bayesian Accessibility (SYBA) metric72 was evaluated by fitting the default score based on the dataset provided in the corresponding GitHub repository via the ‘fitDefaultScore ()’ function of the ‘SybaClassifier’ object (cf.https://github.com/lich-uct/syba, master branch). The Retrosynthetic accessibility score (RAscore)74 was evaluated based on the Tensorflow-based model provided in the corresponding GitHub repository (cf.https://github.com/reymond-group/RAscore, master branch).
Mutation and crossover are based on the STONED algorithm8 (cf.https://github.com/aspuru-guzik-group/stoned-selfies). Mutations are performed using both the SMILES21,22 and SELFIES6 molecular string representations. SMILES are first reordered randomly and then converted to SELFIES. The obtained SELFIES are modified randomly via string point mutations. The resulting SELFIES after mutation represent the mutant structure. Crossover between two molecules is performed by interpolation between the corresponding SELFIES. Thus, paths are generated by successively matching characters of the two SELFIES and changing the character of one to the respective character of the other until completion. Multiple paths between two molecules are generated by changing the order of character matching and by converting the SELFIES of both molecules to SMILES, reordering the SMILES randomly and then converting them back to SELFIES. The final molecule accepted as crossover is chosen by maximizing the joint similarity of the generated structures to the initial structures. The input molecules for the crossover operation are kekulized prior to path formation. When performing mutation and crossover in JANUS, we ensure that the new molecule is different from any of the initial structures.
To find the best crossover molecule, we implement the joint similarity metric proposed previously.8 For the parent molecules M = {m1, m2, …}, a good crossover molecule (m) maximizes the following joint similarity:
(2) |
Similarity between two molecules is assessed using the Tanimoto similarity between the corresponding Morgan fingerprints.85 We use the default RDKit settings with a radius of size 2 and a 2048 bit restriction.
The architecture of all DNNs consists of two fully connected layers with 100 and 10 neurons, respectively. The DNN classifier is trained with the binary cross-entropy loss function, while the DNN property predictor is trained with the mean squared error loss function. Training is restricted to 4000 epochs, with a learning rate of 0.01 and a weight decay of 0.0001 using the entire set of molecules with known fitness as training set without the use of any test set. The Adam optimizer86 is employed and PyTorch (version 1.10.2) is used as backend.87 51 molecular structural descriptors obtained from RDKit,84 which are explicitly detailed in ESI Section S5,† are used as input features. For a particular generation, the DNN is trained on all molecules with known fitness values (i.e., all molecules with successful fitness calculations from all previous generations). All molecules from the current generation with yet unknown fitness are used for inference. We observe that training of the DNN takes approximately 170.9 ± 4.9 seconds (tested on 12 Intel i7-8750H threads, at 2.20 GHz, across five independent runs) in the 5th generation of JANUS when the number of training points is 2500. In the case of the DNN classifier, the labels for all molecules are reevaluated each generation based on the corresponding user-defined percentile of the training data. Hence, molecules with a fitness larger than the corresponding percentile each generation are assigned a label of one and all other molecules are assigned a label of zero. In this work, we tested percentiles of 0.50 (C50%, vide supra) and 0.80 (C20%, vide supra). We would like to note that, depending on the chosen percentile, this can lead to a systematic imbalance when training the DNN classifier. Thus, we recommend not to use too extreme percentiles.
Codes for the penalized logP experiments (Sections IVA and S6†) and the imitated inhibition task (Section IVB) were obtained from the master branches of the GitHub repositories at https://github.com/aspuru-guzik-group/GA and https://github.com/wengong-jin/multiobj-rationale, respectively. For the GuacaMol benchmarks (ESI Section S13†), we used the setup provided at https://github.com/BenevolentAI/guacamol. Docking scores in Section IVC are calculated using SMINA (Version: 9th November 2017, based on AutoDock Vina 1.1.2),88 with the default parameters of the corresponding code: https://github.com/cieplinski-tobiasz/smina-docking-benchmark.
For the penalized logP experiment from Section IVA, JANUS is seeded with a population similar to the GA + D7 implementation, i.e., methane and 20 additional molecules obtained from random SELFIES. For the docking experiment in Section IVC, we use 20 random molecules. However, instead of methane, we add n-heptane because docking with extremely small molecules like methane can give meaningless results. EvoMol43 was run using the penalized logP objective and the code from https://github.com/jules-leguy/EvoMol. Our results in Table 1 were obtained by setting the generation number to 100 (variable max_steps), the generation size to 500 (variable pop_max_size) and by replacing 400 (80%) of the least fit molecules between generations (via the parameter k_to_replace).
In the imitated docking experiment (Section IVB), the first five rows of Table 2 were taken from the literature and all details are described therein.62 For molecular docking (Section IVC), all results displayed in Table 3, besides JANUS and GA + D, were taken from the literature78 and all details can be found in that work. GA + D was run using the code from https://github.com/aspuru-guzik-group/GA, by replacing penalized logP with the appropriate fitness functions, with otherwise identical settings as JANUS, i.e., same generation size and generation number. The training dataset provided with the benchmark in the corresponding GitHub repository (cf.https://github.com/cieplinski-tobiasz/smina-docking-benchmark), which was taken from the ChEMBL database,69,70 was used to estimate reference synthesizability scores.
Footnotes |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d2dd00003b |
‡ Equal contributions. |
This journal is © The Royal Society of Chemistry 2022 |