Amit Kadan‡
*a,
Kevin Ryczko‡
*a,
Erika Lloyda,
Adrian Roitbergb and
Takeshi Yamazaki
*a
aSandboxAQ, Palo Alto, CA, USA. E-mail: amit.kadan@sandboxaq.com; kevin.ryczko@sandboxaq.com; takeshi.yamazaki@sandboxaq.com
bDepartment of Chemistry, University of Florida, PO Box 117200, Gainesville, Florida 32611-7200, USA
First published on 29th May 2025
Generative AI has the potential to revolutionize drug discovery. Yet, despite recent advances in deep learning, existing models cannot generate molecules that satisfy all desired physicochemical properties. Herein, we describe IDOLpro, a novel generative chemistry AI combining diffusion with multi-objective optimization for structure-based drug design. Differentiable scoring functions guide the latent variables of the diffusion model to explore uncharted chemical space and generate novel ligands in silico, optimizing a plurality of target physicochemical properties. We demonstrate our platform's effectiveness by generating ligands with optimized binding affinity and synthetic accessibility on two benchmark sets. IDOLpro produces ligands with binding affinities over 10–20% higher than the next best state-of-the-art method on each test set, producing more drug-like molecules with generally better synthetic accessibility scores than other methods. We do a head-to-head comparison of IDOLpro against an exhaustive virtual screen of a large database of drug-like molecules. We show that IDOLpro can generate molecules for a range of important disease-related targets with better binding affinity and synthetic accessibility than any molecule found in the virtual screen while being over 100× faster and less expensive to run. On a test set of experimental complexes, IDOLpro is the first to produce molecules with better binding affinities than the experimentally observed ligands. IDOLpro can accommodate other scoring functions (e.g. ADME-Tox) to accelerate hit-finding, hit-to-lead, and lead optimization for drug discovery.
Deep-learning (DL) based generative models for SBDD can replace virtual screening by directly predicting high-affinity ligands for a given protein pocket.15–20 Initially, these models were predominantly autoregressive. Ref. 20 develops a sampling scheme based on Monte-Carlo tree search to condition an autoregressive generative model to generate molecules directly into the protein pocket. Ref. 16 trains an equivariant graph neural network called Pocket2Mol to predict novel molecules in a protein pocket by sequentially predicting atoms of the ligand given the current atomic context. Recently, several diffusion-based generative models have been proposed.18,19 Diffusion models learn to sample molecules by learning to reverse a diffusive noising process.21 Unlike autoregressive techniques, these models allow for consideration of global relationships between atoms in the ligand throughout generation. Ref. 18 introduces an equivariant diffusion model called TargetDiff which generates ligands from scratch directly in the protein pocket. They also showed that with proper parameterization, their model can be used as an unsupervised binding affinity predictor, allowing them to filter promising candidates during generation. Similarly, ref. 19 introduces two equivariant diffusion models for SBDD. On top of de novo generation, these models are able to perform scaffold-hopping and fragment merging, both of which are facilitated by an inpainting-inspired molecular generation scheme.22
DL-based generative models are more efficient than database iteration and can produce molecules that don't currently exist in commercial databases, expanding the chemical space available during the drug-discovery process.20 However, in some cases they have been shown to produce ligands that do not exhibit desired physicochemical properties, producing compounds with unphysical structures,23,24 or producing compounds that exhibit poor synthesizability.25 Just like in database generation, to operate in an inverse design pipeline, molecules generated by these models must be filtered and ranked with physicochemical descriptors, and there is no guarantee of finding a molecule satisfying a set of desired properties.
Alternatively, guidance from physicochemical scores can be directly integrated into the generation process.6,8,26 Ref. 20 pairs a policy network with Monte Carlo tree search to design ligands with optimized Vina scores. They apply their methodology to design an inhibitor for the main protease of SARS-CoV-2, producing novel ligands with high binding affinity. Ref. 19 pairs a diffusion model with an evolutionary algorithm to optimize the properties of generated ligands. Ref. 27 uses property predictors to guide a generative model composed of an equivariant autoencoder and transformer decoder in generating molecules with good target properties. These works do not make use of gradient information in their optimization schemes. Gradient information forms the basis of many modern optimization algorithms,28 particularly in DL applications, allowing for optimal scaling with the degrees of freedom of the optimization.29 Ref. 30 introduces classifier guidance, a technique that allows diffusion models to generate samples conditioned on the gradients of classifiers throughout the reverse diffusion process. Similarly, one can use gradients from regressors trained to quantify properties of interest. Regressor guidance has been implemented in recent works on generative molecular design.8,10,31 Ref. 31 uses regressor guidance to re-purpose a generative model trained to generate gas-phase molecules for structure-based drug design. Ref. 10 uses both regressor and classifier guidance to generate optimized polycyclic aromatic systems - molecules important for the design of organic semiconductors. Ref. 8 uses regressor guidance to design drug-like molecules with optimized target properties including protein–ligand interaction, drug-likeness, and synthesizability. A drawback of regressor guidance is that it requires training ad-hoc property predictors on intermediate states in the reverse diffusion of a particular generative model, and cannot be used with an abundance of state-of-the-art physicochemical scores and models designed to make predictions on three-dimensional molecules.32–34
In this work, we present IDOLpro (Inverse Design of Optimal Ligands for Protein pockets), a generative workflow that actively guides a diffusion model to generate an optimized set of ligands for a target protein pocket. Specifically, we modify the latent variables of a diffusion model at a carefully chosen time step in the reverse diffusion process to optimize one or more objectives of interest simultaneously. We achieve this by backpropagating gradients obtained by evaluating physicochemical scores directly on the generated molecular structures. This avoids the need for developing ad-hoc regressors capable of making predictions on intermediate noisy representations, and allows us to directly leverage common scores for assessing ligand quality. In this report, we include binding affinity and synthetic accessibility, but our framework is highly modular and can incorporate alternative generators and additional scores. All metrics are written in PyTorch35 and are fully differentiable with respect to the latent variables within the diffusion model, allowing for the use of gradient-based optimization strategies to design optimal ligands.
In this report, our method uses DiffSBDD19 as the baseline generative method for predicting ligands within a protein pocket. We use a specific variant of the model, DiffSBDD-Cond, which we found was able to generate ligands within the protein pocket without clashes more reliably than the alternative, DiffSBDD-inpaint. We assess the ability of our framework to discover novel ligands with improved binding affinity and synthetic accessibility. To evaluate and be able to gather gradient information for binding affinity, we have developed a torch-based implementation of the Vina score,32 which we refer to as torchvina. To evaluate synthetic accessibility, we have trained an equivariant neural network model to predict the synthetic accessibility (SA) score first proposed in ref. 36, which we refer to as torchSA. During structural refinement, we also make use of the ANI2x37 neural network potential, which we use to optimize intramolecular forces, allowing IDOLpro to produce physically valid molecules.
All three databases contain protein pocket–ligand pairs. The CrossDocked dataset contains 100 pocket–ligand pairs which were derived via re-docking ligands to non-cognate receptors with smina.41 This test set was used to validate the performance of tools in several other papers,15,16 including DiffSBDD.19 The Binding MOAD contains 130 high-resolution (<2.5 Å) experimentally derived pocket-ligand pairs extracted from the Protein Data Bank (PDB). This test set was also used to assess the performance of DiffSBDD. The RGA test set consists of 10 disease-related protein targets including G-protein coupling receptors (GPCRs) and kinases from the DUD-E dataset,42 as well as the SARS-CoV-2 main protease.43 This is also an experimentally derived test set and was used to assess the performance of several non-deep learning-based methods in ref. 40.
We report a number metrics averaged across the protein pockets in each test set. Each metric is recorded before and after optimization with IDOLpro. Reported metrics include the Vina and top-10% Vina scores of generated ligands, and three metrics using the SA score - the average SA, top-10% SA and the percentage of synthesizable molecules generated. In this work, we define a ligand as synthesizable if it achieves an SA score of less than 3.5. Although the inventors of the SA score suggest 6 as the cutoff for synthesizability, a number of papers have found SA scores between 3.5 and 6 to be ambiguous.25,33 A cutoff of 3.5 was used to determine synthesizability in ref. 44. We make note of the average QED (quantitative estimate of drug-likeness),45 a metric combining several desirable molecular properties for screening drug-like molecule, and pocket diversity - the average pairwise dissimilarity between all generated molecules for a given protein pocket. Dissimilarity is measured as 1–Tanimoto similarity. Lastly, we use PoseBusters23 to assess the validity of generated molecules. PoseBusters assesses the validity of generated molecules by running a series of checks using RDKit,46 ensuring stereochemistry and intra- and inter-molecular constraints such as appropriate bond lengths, planarity of aromatic rings, and no overlap with the protein pocket are satisfied.
The results are shown in Table 1. We find that IDOLpro generates ligands with better Vina and SA scores than DiffSBDD-cond, yielding molecules with ≈20% better Vina scores and ≈21% better SA scores on the CrossDocked test set, and ≈26% better Vina scores and ≈21% better SA scores on the Binding MOAD test set. Furthermore, IDOLpro yields more than double the amount of synthesizable ligands compared with DiffSBDD-cond for each dataset (51.2% vs. 23.5% and 56.9% vs. 22.6%). Lastly, despite not optimizing for these directly, we find that IDOLpro generates molecules with slightly higher QED and PoseBusters validity (PB-valid in Table 1) than DiffSBDD-Cond, achieving higher values on both the CrossDocked and Binding MOAD test sets. Despite IDOLpro performing an optimization in latent space, it does not decrease the diversity of generated molecules, achieving equal average pocket diversity to DiffSBDD-Cond on each test set.
Dataset | Method | Vina [kcal mol−1] | Vina10% [kcal mol−1] | SA | SA10% | Synth [%] | QED | Diversity | PB-valid [%] |
---|---|---|---|---|---|---|---|---|---|
CrossDocked | DiffSBDD-cond19 | −5.37 ± 1.93 | −7.70 ± 2.25 | 4.30 ± 0.50 | 2.56 ± 0.63 | 23.5 ± 15.5 | 0.56 ± 0.05 | 0.78 ± 0.07 | 55.9 ± 11.1 |
IDOLpro | −6.47 ± 2.10 | −8.69 ± 2.45 | 3.41 ± 0.70 | 1.73 ± 0.51 | 51.2 ± 22.7 | 0.63 ± 0.06 | 0.79 ± 0.07 | 64.3 ± 8.3 | |
MOAD | DiffSBDD-cond19 | −5.38 ± 2.55 | −7.66 ± 2.37 | 4.18 ± 0.50 | 2.50 ± 0.53 | 26.5 ± 17.8 | 0.58 ± 0.07 | 0.77 ± 0.07 | 53.6 ± 15.7 |
IDOLpro | −6.77 ± 2.24 | −8.68 ± 2.56 | 3.32 ± 0.66 | 1.79 ± 0.46 | 56.9 ± 22.6 | 0.63 ± 0.08 | 0.77 ± 0.07 | 60.4 ± 16.3 |
We visualize a single example from each benchmark test set in Fig. 3. The figure shows an example of a molecule produced by DiffSBDD prior to latent vector optimization, and after latent vector optimization by IDOLpro. We also picture the corresponding reference molecule, showing that latent vector optimization allows IDOLpro to produce molecules with better metrics than the reference molecule. Overall, IDOLpro is able to co-optimize Vina and SA, producing molecules with significantly better binding affinities and synthetic accessibility when compared to the baseline model, DiffSBDD-cond.
For each method, we report the SA score, top-10% SA score, percentage of synthesizable molecules, QED, pocket diversity, and PoseBusters validity averaged across all protein pockets in the test set. We also report the average time required by each technique for generating a single ligand. Full results are showing in Table 2. We find that IDOLpro generates ligands with better average SA scores than DiffSBDD with regressor guidance, yielding molecules with ≈27% better SA scores (2.79 vs. 3.53). IDOLpro also produces a significantly higher number of synthesizable molecules (>20% more) than DiffSBDD with regressor guidance. On the other hand, DiffSBDD with regressor guidance achieves a better top-10% SA score than IDOLpro. A major advantage of regressor guidance over IDOLpro is that it is ≈10× faster for generating a single ligand. This is expected as regressor requires running reverse diffusion once, while IDOLpro unwinds the latter portion of reverse diffusion multiple times. IDOLpro yields molecules with significantly better QED and PoseBusters validity than DiffSBDD with regressor guidance, suggesting that adding gradients throughout reverse diffusion hinders the model's ability to adequately account for both intra- and inter-molecular constraints. This is corroborated by the fact that DiffSBDD without regressor guidance produces significantly more PoseBusters-valid molecules than DiffSBDD with regressor guidance (see Table 1 for vanilla DiffSBDD metrics).
Method | SA | SA10% | Synth [%] | QED | Diversity | PB-valid [%] | Time [s/ligand] |
---|---|---|---|---|---|---|---|
Regressor guidance | 3.53 ± 0.37 | 1.38 ± 0.18 | 50.6 ± 10.5 | 0.37 ± 0.01 | 0.90 ± 0.02 | 7.6 ± 3.6 | 8.11 ± 1.64 |
IDOLpro | 2.79 ± 0.55 | 1.53 ± 0.40 | 71.51 ± 18.6 | 0.62 ± 0.06 | 0.81 ± 0.06 | 43.7 ± 16.2 | 82.33 ± 52.51 |
In ref. 19, the generated ligands of each deep learning model are re-docked to the target protein pocket with QuickVina2.47 In our workflow, we replace re-docking with structural refinement. In QuickVina2, molecular conformations are optimized using both global and local optimization, while structural refinement only performs local optimization on the ligand coordinates via L-BFGS. For more information on structural refinement, see Section 4.6.
Full results are shown in Table 3. The performance of the models on the two optimized metrics - Vina and SA is visualized in Fig. 2. For CrossDocked, IDOLpro achieves greatly improved Vina scores relative to other DL tools, with a 0.71 kcal mol−1 improvement in average Vina score and 1.03 kcal mol−1 improvement in top-10% Vina score compared to the next best tool in the literature, DiffSBDD-inpaint. Despite not optimizing QED directly, we find that IDOLpro produces ligands with the best QED out of all DL tools compared. IDOLpro ranks second for producing molecules with good SA scores, showing the ability of the platform to perform multi-objective optimization. Despite needing to run an entire optimization procedure for each ligand, IDOLpro is still computationally tractable, achieving run times competitive with two other tools - TargetDiff and Pocket2Mol, while achieving a faster runtime than 3D-SBDD (Fig. 3).
Method | Vina [kcal mol−1] | Vina10% [kcal mol−1] | SA | QED | Diversity | Time [s/ligand] | |
---|---|---|---|---|---|---|---|
CrossDocked | Test set | −6.87 ± 2.32 | — | 3.45 ± 1.26 | 0.48 ± 0.20 | — | — |
3D-SBDD15 | −5.89 ± 1.91 | −7.29 ± 2.34 | 3.93 ± 1.26 | 0.50 ± 0.17 | 0.74 ± 0.09 | 328.13 ± 245.43 | |
Pocket2Mol16 | −7.06 ± 2.80 | −8.71 ± 3.18 | 3.23 ± 1.08 | 0.57 ± 0.16 | 0.74 ± 0.15 | 41.79 ± 36.84 | |
GraphBP17 | −4.72 ± 4.03 | −7.17 ± 1.40 | 7.24 ± 0.81 | 0.50 ± 0.12 | 0.84 ± 0.01 | 0.17 ± 0.02 | |
TargetDiff18 | −7.32 ± 2.47 | −9.67 ± 2.55 | 4.74 ± 1.17 | 0.48 ± 0.20 | 0.72 ± 0.09 | ∼57.22 | |
DiffSBDD-cond19 | −6.95 ± 2.06 | −9.12 ± 2.16 | 4.80 ± 1.17 | 0.47 ± 0.21 | 0.73 ± 0.07 | 2.27 ± 0.86 | |
DiffSBDD-inpaint19 | −7.33 ± 2.56 | −9.93 ± 2.59 | 5.01 ± 1.08 | 0.47 ± 0.18 | 0.76 ± 0.05 | 2.67 ± 1.22 | |
IDOLpro | −8.04 ± 2.55 | −10.96 ± 3.02 | 3.41 ± 0.70 | 0.63 ± 0.06 | 0.79 ± 0.07 | 58.80 ± 32.97 | |
MOAD | Test set | −8.41 ± 2.03 | — | 3.77 ± 1.08 | 0.52 ± 0.17 | — | — |
GraphBP17 | −4.84 ± 2.24 | −6.63 ± 0.95 | 7.21 ± 0.81 | 0.51 ± 0.11 | 0.83 ± 0.01 | 0.23 ± 0.03 | |
DiffSBDD-cond19 | −7.17 ± 1.89 | −9.18 ± 2.23 | 4.89 ± 1.08 | 0.44 ± 0.20 | 0.71 ± 0.08 | 5.61 ± 1.42 | |
DiffSBDD-inpaint19 | −7.31 ± 4.03 | −9.84 ± 2.18 | 4.47 ± 1.08 | 0.54 ± 0.21 | 0.74 ± 0.05 | 6.17 ± 2.08 | |
IDOLpro | −8.74 ± 2.59 | −11.23 ± 3.12 | 3.32 ± 0.66 | 0.63 ± 0.08 | 0.77 ± 0.07 | 82.30 ± 45.07 |
![]() | ||
Fig. 3 Molecules produced by IDOLpro when optimizing torchvina and torchSA. One example from each test set is shown - protein 4aua from CrossDocked, and protein 3cjo from the Binding MOAD. Left column: reference molecules from the test sets. Middle column: initial ligand produced by DiffSBDD prior to latent vector optimization. Right column: molecule produced by IDOLpro after optimizing torchvina and torchSA. For each example, the molecule produced by DiffSBDD has worse Vina and SA scores than the reference molecule. After optimization with IDOLpro, both the Vina and SA scores of the generated molecule are better than the reference. Visualizations were created with PyMol,48 and interactions were visualized with the protein-ligand interaction profiler (PLIP).49 |
IDOLpro is almost always able to find ligands that improve upon both the Vina and SA scores relative to the reference ligand, failing to find such a ligand for only one target - the protein pocket defined by ligand x2p bound to the protein with PDB ID 5mma. For this protein pocket, IDOLpro generated molecules with significantly better SA scores than the reference molecule, but did not generate a single molecule with a better Vina score. We hypothesize that re-weighting the objective in the optimization to more heavily favour the Vina score would allow IDOLpro to generate a ligand that improves upon both metrics.
For the Binding MOAD, the advantage of IDOLpro for generating molecules with high binding affinity is even more pronounced, with a 1.43 kcal mol−1 improvement in average Vina score, and 1.40 kcal mol−1 improvement in top-10% Vina score compared to the next best method, DiffSBDD-inpaint. In particular, IDOLpro is the first ML tool to generate molecules with a better average Vina score than those of the reference molecules in the Binding MOAD test set. This is noteworthy, because unlike molecules in CrossDocked, molecules in the Binding MOAD were derived through experiment. Out of the four methods compared, IDOLpro also achieves the best SA, improving upon the next best method by 1.15, while also achieving the best QED despite not optimizing for it directly. The time to generate a single ligand for a protein pocket in the Binding MOAD test set is slower than for the CrossDocked test set, and reflects DiffSBDD's slowdown in proposing novel molecules for targets in this set.
IDOLpro is able to find a molecule with a better Vina and SA score than the reference ligand for 126/130 targets (≈97% of the time). Similar to the one case in CrossDocked for which IDOLpro did not find a ligand with improved Vina and SA score relative to the reference ligand, for these 4 targets IDOLpro finds molecules with significantly better SA scores than the reference, but fails to find a molecule with a better Vina score. All four of these cases have reference molecules with SA scores of over 7, whereas IDOLpro does not produce a single molecule with an SA score over 6. We again hypothesize that this could be solved by re-weighting the optimization objective to more heavily favor Vina score.
Overall IDOLpro can effectively co-optimize multiple objectives, generating ligands with state-of-the-art binding affinity and synthetic accessibility on two test sets. Improving other metrics with IDOLpro is straightforward, simply requiring a differentiable score for evaluating the desired metric. Part of our future work will be to focus on including differentiable scoring functions for other desirable properties such as solubility, toxicity, etc.
For this experiment, we measure how quickly IDOLpro can generate a ligand with both better binding affinity (Vina score) and synthetic accessibility (SA score) than the ligand with the best binding affinity found when screening ZINC250K. For screening ZINC250K, we use QuickVina247 to quickly dock molecules to each of the 10 target protein pockets in the RGA test set. We run docking on an AWS compute-optimized instance with 8 CPU cores. We then use IDOLpro to generate optimized ligands, making note of the number of ligands, time, and cost required for IDOLpro to generate an improved ligand relative to virtually screening ZINC250K. Cost is based on the AWS pricing for the requested instances. The results are shown in Table 4.
Method | Nligands | Time [h] | Cost [$] |
---|---|---|---|
IDOLpro | 44.90 ± 46.06 | 0.93 ± 0.75 | 1.06 ± 0.86 |
Virtual Screen | 250![]() |
160.90 ± 24.12 | 109.41 ± 16.40 |
Screening ZINC250K using QuickVina2 takes an average of ≈161 hours per protein pocket. On the other hand, IDOLpro is able to find a ligand with a better Vina and SA score than the virtual screen in under an hour (≈56 minutes) on average. This translates to ≈173× speedup in terms of time, and ≈103× reduction in cost. For 4/10 cases, IDOLpro is able to find a ligand with better Vina and SA score within the first 10 ligands generated, and for 9/10 cases within the first 100 ligands generated. For a single case (PDB ID 3eml) IDOLpro needs to produce 153 ligands to find a ligand with better binding affinity and synthetic accessibility than the one found in the virtual screen. Even in this case, running IDOLpro takes ≈2.5 hours to run on the requested GPU instance, translating to a ≈66× speedup in terms of time, and ≈39× reduction in cost compared to virtually screening ZINC250K.
In general, for a given protein pocket, IDOLpro produces ligands with significantly better binding affinities than those found when screening a standard drug database such as ZINC250K. In Fig. 4, we track the average running top-1 and top-10 Vina scores for IDOLpro across the 10 protein targets as a function of the number of ligands generated. For both the top-1 and top-10 Vina score, IDOLpro quickly surpasses the virtual screen of ZINC250K - needing to generate on average only 1 ligand to surpass the top-1 Vina score of the virtual screen, and needing to generate on average only 15 ligands to surpass the top-10 Vina score of the virtual screen. We also plot the distribution of Vina scores for both the ligands produced by IDOLpro, and those found when screening ZINC250K for 2 of the targets from the RGA test set - 2rgp, a protein whose over-expression has been associated with human tumor growth,50 and 7l11 - the SARS-CoV-2 main protease.43 These plots are shown in Fig. 4. These plots further validate IDOLpro's ability to generate ligands with high binding affinity relative to virtual screening.
![]() | ||
Fig. 4 Comparison of IDOLpro to virtually screening ZINC250K. Top center: The average top-1 and top-10 Vina scores of IDOLpro generated molecules compared to screened molecules from ZINC250K across all 10 targets. Bottom left: The distribution of Vina scores for generated and screened ligands for EGFR, an important oncology target (PDB ID 2rgp). Bottom right: The distribution of Vina scores for generated and screened ligands for the SARS-Cov-2 main protease (PDB ID 7l11). |
Method | Vinatop-100 [kcal mol−1] | Vinatop-10 [kcal mol−1] | Vinatop-1 [kcal mol−1] | SA | QED | Diversity |
---|---|---|---|---|---|---|
MARS51 | −7.76 ± 0.61 | −8.8 ± 0.71 | −9.26 ± 0.79 | 2.69 ± 0.08 | 0.71 ± 0.01 | 0.88 ± 0.00 |
MolDQN52 | −6.29 ± 0.40 | −7.04 ± 0.49 | −7.50 ± 0.40 | 5.83 ± 0.18 | 0.17 ± 0.02 | 0.88 ± 0.01 |
GEGL53 | −9.06 ± 0.92 | −9.91 ± 0.99 | −10.45 ± 1.04 | 2.99 ± 0.05 | 0.64 ± 0.01 | 0.85 ± 0.00 |
REINVENT54 | −10.81 ± 0.44 | −11.23 ± 0.63 | −12.01 ± 0.83 | 2.60 ± 0.12 | 0.45 ± 0.06 | 0.86 ± 0.01 |
RationaleRL55 | −9.23 ± 0.92 | −10.83 ± 0.86 | −11.64 ± 1.10 | 2.92 ± 0.13 | 0.32 ± 0.02 | 0.72 ± 0.03 |
GA + D56 | −8.69 ± 0.45 | −9.29 ± 0.58 | −9.83 ± 0.32 | 3.45 ± 0.12 | 0.70 ± 0.02 | 0.87 ± 0.01 |
Graph-GA57 | −10.48 ± 0.86 | −11.70 ± 0.93 | −12.30 ± 1.91 | 3.50 ± 0.37 | 0.46 ± 0.07 | 0.81 ± 0.04 |
Autogrow 4.0![]() |
−11.37 ± 0.40 | −12.21 ± 0.62 | −12.47 ± 0.84 | 2.50 ± 0.05 | 0.75 ± 0.02 | 0.85 ± 0.01 |
RGA40 | −11.87 ± 0.17 | −12.56 ± 0.29 | −12.89 ± 0.47 | 2.47 ± 0.05 | 0.74 ± 0.04 | 0.86 ± 0.02 |
IDOLpro | −14.59 ± 1.51 | −16.26 ± 1.66 | −17.35 ± 2.10 | 3.77 ± 0.33 | 0.64 ± 0.06 | 0.72 ± 0.04 |
IDOLpro greatly outperforms non-DL techniques in terms of Vina score, improving on the next best method by ≈23%, ≈29%, and ≈35% in terms of average top-100, top-10, and top-1 Vina score respectively. Unlike when compared to other DL methods, IDOLpro ranks behind most of these methods in terms of average SA, ranking 9th out of the 10 methods compared. IDOLpro is middle-of-the-pack in terms of QED, ranking 5th out of the 10 methods compared. This shows that IDOLpro, and deep learning methods in general, have a ways to go before they can produce molecules with the same synthesizability and drug-likeness as other advanced non-DL methods in the literature. This is an ongoing area of research,23,25 and is an aspect that we would like to improve in IDOLpro.
We test the capability of our framework for performing lead optimization with examples from the CrossDocked test set. We specify atoms to fix by using RDKit to identify the Bemis-Murcko scaffold61 of each reference ligand. If no scaffold is found (13 cases), if it is identified to be greater than 90% of the full reference ligand (13 cases), or if it contains atoms not supported by the ANI2x model (3 cases), then these targets are removed from the test set. The lead optimization results with IDOLpro for the remaining 71 protein and scaffold pairs are shown in Table 6 where they are compared to the original reference ligands. We report the Vina and SA scores of the un-docked reference ligands since their scaffolds' coordinates act as the seed for IDOLpro optimization.
Method | Vina [kcal mol−1] | Vina10% [kcal mol−1] | SA | SA10% |
---|---|---|---|---|
Test Set | −5.58 ± 2.32 | — | 3.67 ± 1.23 | — |
IDOLpro | −7.17 ± 2.36 | −8.96 ± 2.57 | 4.12 ± 1.10 | 2.90 ± 1.12 |
We find that the average Vina scores of the optimized ligands exceed the average values of the reference ligands in the test set (1.59 kcal mol−1). Although the average SA is higher, both the top-10% SA score and the top-10% Vina are significantly better than the test set, and are more realistic to consider when assessing lead optimization capability. A molecule with both better SA and Vina scores compared to the reference ligands was found for all but one target, where we were able to improve the SA score but not the Vina score. As discussed previously, a possible remedy would be re-weighting the optimization objective to more heavily favor Vina score.
Two examples of generated molecules that retain the original seed scaffold and successfully improve both the Vina and SA scores compared to the reference ligand are shown in Fig. 5. The first is a case where the identified scaffold is a small portion of the reference (ligand with PDB residue ID plp docked into protein 2jjg), and we see a large diversity in the generated molecules. The second is a case where the scaffold is a large component of the reference (ligand with PDB residue ID 1ly docked into protein 1a2g) where we are effectively optimizing functional groups on a fixed scaffold. We believe our results demonstrate the flexibility of atom fixing in IDOLpro, and its utility in performing lead optimization.
![]() | (1) |
![]() | (2) |
DiffSBDD is an SE(3)-equivariant64 3D-conditional DDPM which respects translation, rotation, and permutation symmetries. DiffSBDD was trained to create ligands with high binding affinity given a target protein pocket. In DiffSBDD, data samples consist of protein pocket and ligand point clouds, i.e., z = [z, zp]. Each point cloud consists of atom types and coordinates, z = [r, h] where
is a tensor of atomic coordinates, and
is a tensor of atomic probabilities over the atom types which the model can generate. Within the model, each zt is converted to a graph, and processed by an EGNN65 to produce a prediction ε
θ(zt, t). DiffSBDD contains two different models for 3D pocket conditioning - a conditional DDPM that receives a fixed pocket representation as the context in each denoising step, and a model that is trained to approximate the joint distribution of ligand-protein pocket pairs and is combined with an inpainting-inspired22,66 sampling procedure at inference time to yield conditional samples of ligands given a fixed protein pocket. In this work, we focus only on the conditional DDPM for ligand generation. Our framework requires that generated ligands do not overlap with the target protein pocket, as the ligands' poses later undergo local structural refinement. We found that the conditional DDPM model can consistently generate ligands satisfying this constraint.
For each pocket-conditioning scheme, DiffSBDD contains two versions of the model - one trained on a subset of CrossDocked,38 and another trained on a subset of the Binding MOAD.39 For CrossDocked, ref. 19 used the same train/test splits as in ref. 15 and ref. 16, resulting in 100000 complexes for training, and 100 protein pockets for testing. For the Binding MOAD, ref. 19 filtered the database to contain only molecules with atom types compatible with their model, and removed corrupted entries, resulting in 40
344 complexes for training, and 130 protein pockets for testing. In this work we use only version of DiffSBDD trained on CrossDocked as we found we were able to achieve good results on both test sets (CrossDocked and the Binding MOAD) with just this model.
DiffSBDD was shown to achieve state-of-the-art performance on both test sets. In particular, DiffSBDD achieved the best average, and best top-10% Vina score when compared with other state-of-the-art models in the literature - 3D-SBDD,15 Pocket2Mol,16 GraphBP,17 and TargetDiff.18 We note, that although DiffSBDD is used as our baseline model in this report, our framework is not limited to the use of this specific model - any other diffusion model can take its place.
In addition, we have four more checks to ensure the structural validity of the ligand. These four checks are necessary to be able to run structural refinement with IDOLpro, which makes use of the ANI2x model.37 Structural refinement is described in Section 4.6. We first make sure that the ligand contains only atoms compatible with ANI2x. DiffSBDD can generate ligands with four atom types that are incompatible with ANI2x - B, P, Br, and I. We also make sure that the bond lengths in the ligand are reasonable by referring to covalent radii, and that the ligand does not overlap with the protein pocket. This is done via ASE's68 (Atomic Simulation Environment) NeighborList class. Lastly, We make sure that the atoms do not have significant overlap within the ligand itself. This is done via pymatgen's69 Molecule class.
The Vina force field is composed of a weighted sum of atomic interactions. Steric, hydrophobic, and hydrogen bonding interactions are calculated and weighted according to a nonlinear fit to structural data.32 The final score is re-weighted by the number of rotatable bonds to account for entropic penalties.71 The Vina score is composed of a sum of intramolecular and intermolecular terms, both of which are integrated into our implementation.
To validate our implementation of torchvina, we took 4 systems from our validation set (details in the ESI†). The Vina score consists of a weighted sum of five individual terms, three steric terms - two attractive Gaussian terms (gauss1 and gauss2), and a repulsion term, as well as hydrophobic interactions, and where applicable, hydrogen bonding. For these four systems, we made sure that torchvina produced the same value on each of these sub-terms, as well as the total intermolecular energy and the total intramolecular energy as the smina41 tool. These checks are included in our github repository in the tests directory.
To validate the torchSA model for guiding DiffSBDD towards generating molecules with high synthetic accessibility, we generated 100 molecules using IDOLpro with only torchvina guidance for each target in our validation set (see ESI† for details on structures in this set). For each generated molecule, we save the entire trajectory of point clouds required to produce the final IDOLpro molecule, i.e., the point cloud produced by DiffSBDD during each gradient update of the latent vectors. For each of these intermediate point clouds, we store the coordinates, and raw atomic probabilities. For each point cloud, we also store their corresponding RDKit molecule. This is done by taking the most likely atom type for each probabilty distribution, and adding bonds using OpenBabel.73 We generate 25926 individual point cloud–molecule pairs for evaluating the torchSA model. For each pair, we compare the output of torchSA on the point cloud to the SA score on the corresponding molecule. We plot the correlation between the two in Fig. 6, and include both the Pearson's ρ, and mean absolute error in the figure.
![]() | ||
Fig. 6 Correlation plot between the predictions of the torchSA model and the SA score as computed in RDKit.46 The torchSA model is validated on ≈25![]() |
When optimizing latent vectors in DiffSBDD, we do not modify the initial latent vectors used by the model. Instead, we define an optimization horizon, thz. First latent vectors are generated up to the optimization horizon . This latent vector is saved, and the remaining latent vectors
, are generated. Upon passing the ligand validity checks (Section 4.3), the gradient of the objective with respect to zthz is evaluated, and zthz is modified using a gradient-based optimizer. When re-generating ligands, rather than starting from z
T, only latent vectors proceeding the optimization horizon are re-generated, i.e.,
. Optimization continues until a maximum number of steps, Nmax, have been taken in the latent space. When de-noising z
t during lead optimization, a mask
is used to keep specific atoms fixed during the inpainting procedure. We employ backtracking, early-stopping, and learning-rate decay, all described in the ESI.† The optimization of a single ligand is provided in Algorithm 1.
In this work, we focus on using two combinations of evaluators: torchvina on its own, and torchvina in combination with torchSA. We use the Adam79 optimizer with a learning rate of 0.1, β1 = 0.5 and β2 = 0.999 to modify latent vectors. We perform hyperparameter optimization to choose the optimization horizon, described in the SI.
We use the L-BFGS optimizer in Pytorch35 to perform coordinate optimization. Our optimization algorithm is implemented with Pytorch and is parallelizable on a GPU. Using structural refinement instead of re-docking each individually generated ligand using an auto-docking software such as AutoDock Vina32 or QuickVina247 affords us an increase in overall computational efficiency.In this work, we only use one combination of evaluators to perform coordinate optimization: torchvina and ANI2x.37
To balance the validity of relaxed molecules with docking performance, We consider various combinations of intra- and inter-molecular forces derived from the torchvina and ANI2x potentials. We run structural refinement on a set of 200 ligands generated by IDOLpro when seeded with pockets in our validation set (see SI for details). We tabulate the results in Table 7 and include metrics when QuickVina247 is used for re-docking. For each setting, we compute the average Vina and top-10% Vina scores of relaxed molecules, as well as percent of molecules which remain valid according to our validity checks after structural refinement is applied. We initially found that invalid molecules were often caused by bonded atoms being pulled apart during structural refinement. To remedy this, we add an L1 penalty for violating bonds in the molecule. We include the effect of adding this penalty in Table 7.
Method | Torchvina | ANI2x | L1 bond penalty | Vina [kcal mol−1] | Vina10% [kcal mol−1] | Validity [%] | Time [s] |
---|---|---|---|---|---|---|---|
QuickVina2 | — | — | — | −8.51 | −9.51 | 95.0 | 75.0 |
IDOLpro | Inter + Intra | Inter + Intra | ✗ | −9.26 | −10.35 | 80.1 | 19.34 |
IDOLpro | Inter + Intra | Intra | ✗ | −9.33 | −10.41 | 77.2 | 22.49 |
IDOLpro | Inter | Inter + Intra | ✗ | −9.38 | −10.53 | 82.1 | 18.53 |
IDOLpro | Inter | Intra | ✗ | −9.53 | −10.70 | 81.6 | 23.90 |
IDOLpro | Inter | Intra | ✓ | −9.39 | −10.68 | 86.6 | 24.45 |
To implement regressor guidance in DiffSBDD, we train a regressor to predict the SA score of a molecule produced by DiffSBDD given an intermediate noisy latent vector. To do so, we run DiffSBDD, and save the entire trajectory of latent vectors, zT, zT−1
, …, z0
. We then train an equivariant graph neural network (EGNN)65 to minimize the L2 loss between its prediction of the SA score given an intermediate noisy latent vector, and the SA score of the final molecule. More information about the training of the EGNN regressor can be found in the SI.
At each step in the denoising process, a gradient term from the trained regressor is added to the sampled latent vector, i.e.,
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5sc01778e |
‡ Equal Contributions. |
This journal is © The Royal Society of Chemistry 2025 |