Jike
Wang‡
a,
Hao
Luo‡
a,
Rui
Qin‡
a,
Mingyang
Wang
a,
Xiaozhe
Wan
b,
Meijing
Fang
a,
Odin
Zhang
a,
Qiaolin
Gou
a,
Qun
Su
a,
Chao
Shen
a,
Ziyi
You
a,
Liwei
Liu
*b,
Chang-Yu
Hsieh
*a,
Tingjun
Hou
*a and
Yu
Kang
*a
aCollege of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China. E-mail: yukang@zju.edu.cn; tingjunhou@zju.edu.cn; kimhsieh@zju.edu.cn
bAdvanced Computing and Storage Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd, Nanjing 210000, Jiangsu, China. E-mail: liuliwei5@huawei.com
First published on 4th December 2024
The generation of three-dimensional (3D) molecules based on target structures represents a cutting-edge challenge in drug discovery. Many existing approaches often produce molecules with invalid configurations, unphysical conformations, suboptimal drug-like qualities, limited synthesizability, and require extensive generation times. To address these challenges, we present 3DSMILES-GPT, a fully language-model-driven framework for 3D molecular generation that utilizes tokens exclusively. We treat both two-dimensional (2D) and 3D molecular representations as linguistic expressions, combining them through full-dimensional representations and pre-training the model on a vast dataset encompassing tens of millions of drug-like molecules. This token-only approach enables the model to comprehensively understand the 2D and 3D characteristics of large-scale molecules. Subsequently, we fine-tune the model using pair-wise structural data of protein pockets and molecules, followed by reinforcement learning to further optimize the biophysical and chemical properties of the generated molecules. Experimental results demonstrate that 3DSMILES-GPT generates molecules that comprehensively outperform existing methods in terms of binding affinity, drug-likeness (QED), and synthetic accessibility score (SAS). Notably, it achieves a 33% enhancement in the quantitative estimation of QED, meanwhile the binding affinity estimated by Vina docking maintaining its state-of-the-art performance. The generation speed is remarkably fast, with the average time approximately 0.45 seconds per generation, representing a threefold increase over the fastest existing methods. This innovative 3DSMILES-GPT approach has the potential to positively impact the generation of 3D molecules in drug discovery.
During earlier periods, ligand-based molecular generation (LBMG) gained significant popularity. These methodologies can be categorized into two primary types based on how generated molecules are represented: graph-based molecular generation and sequence-based molecular generation. The fundamental principle involves representing molecules as graphs or sequences, thus framing the generation task as either a graph structure generation or natural language generation problem. Techniques such as Bayesian optimization (BO) and reinforcement learning are employed to guide the model in generating the desired drug molecules.
Molecules inherently possess structures resembling graphs, rendering it intuitive to express their information graphically. Consequently, methods for molecular design grounded in graph representations and traditional heuristic algorithms have long been established. For example, Brown et al. devised a molecular optimization algorithm based on molecular graphs by employing genetic algorithms in 2004,1 while in 2013, Virshup et al. introduced the ACSESS algorithm.2 With the progression of graph neural networks (GNN) in recent years, these networks have exhibited remarkable adaptability across diverse challenges rooted in graph-structured data. De Cao et al. pioneered the integration of GNN for drug design with their 2018 proposal of MolGAN,3 thereby forging novel pathways in molecular design. As graph-based methodologies continue to advance rapidly, an escalating number of researchers are capitalizing on molecular graph representations for drug design.4–8
When compared to GNN-driven strategies for molecular synthesis, sequence-based methodologies offer a more succinct avenue. This stems from the fact that chemical compounds can be effectively represented through chemical languages such as the Simplified Molecular Input Line Entry System (SMILES)9 or SELFES,10 which mirror the structure of natural language. Consequently, a plethora of scholarly works on molecular design have proposed frameworks based on recurrent neural networks (RNNs) or transformer.11 In 2016, ChemVAE amalgamated variational autoencoders (VAEs) with BO to explore the latent state space in search of molecules with desired attributes.12 In 2017, Olivecrona et al. harnessed reinforcement learning to fine-tune the generation process of RNN-based molecules, yielding structures similar to specified ones or possessing predetermined activities.13 In 2021, Wang et al. shifted towards a transformer decoder instead of RNNs for generation, combining knowledge distillation and reinforcement learning to develop MCMG.14 Over time, a variety of sequence-based molecular generation methodologies have emerged.15–26
However, 2D molecular generation exhibits a significant limitation since these techniques neglect the crucial 3D structural complementarity between protein pockets and molecules. Given the pivotal role of ligand–protein conformational selection in drug design, evaluating such complementary features requires an understanding grounded in the intrinsic 3D structures of protein pockets and molecules. Consequently, there has been emerging interest in 3D structure-based molecular generation.
With the advent of deep geometric learning, numerous studies on autoregressive 3D molecular generation have surfaced. For instance, Gebauer introduced G-SchNet,27 an autoregressive deep neural network that generates diverse small organic molecules by sequentially positioning atoms in Euclidean space. Subsequently, models such as LiGAN,28 GraphBP,29 SBDD,30 and Pocket2Mol31 have been developed to directly generate molecules within pockets.32–35 However, autoregressive methodologies are susceptible to error accumulation, which has spurred the exploration of diffusion-based 3D molecular generation approaches.36–39 These methods facilitate the simultaneous generation of entire molecules, rather than sequentially producing atoms. So far, these strategies have yet to effectively capture the distribution of chemical bonds, leading to the creation of impractical molecular structures.
The 3D molecular generation methods discussed above primarily rely on GNN. While language models (LMs) effectively extract abundant 2D molecular insights from extensive drug-like datasets during 2D molecular design, their ability to represent continuous 3D molecular architectures remains limited. However, with the recent proliferation of large-scale language models (LLMs), numerous studies suggest that LLMs can adeptly acquire continuous numerical representations. Born et al. presented the Regression Transformer,40 which accomplishes unified regression and prediction tasks by encoding numerical values as tokens. The approach demonstrates the potential of Transformers for regression tasks by encoding numerical values as tokens, though it faces limitations in capturing complex 3D molecular structures and may be inefficient when scaled to larger datasets. Furthermore, Flam-Shepherd et al. utilized Cartesian coordinates xyz token to represent the 3D structures of molecules.41 However, this approach may struggle with generating physically plausible conformations due to the discrete nature of tokenization, and may encounter difficulties in maintaining chemical validity. Both methodologies have shown promising effectiveness in their respective drug design endeavors. Recently, Feng et al. proposed Lingo3DMol,42 a fragment-based LM-centric 3D molecular generation model, which exhibits promising performance surpassing that of graph network-based models during their benchmark assessments.
The aforementioned endeavors highlight the adeptness of LMs in discerning intricate details pertaining to the inherent 3D structural characteristics of molecules. Compared to the complex diffusion and GNN-based methods for molecular generation, autoregressive approaches grounded in LMs offer simpler and more efficient training processes. Moreover, token-only paradigms seamlessly integrate with existing universal LLMs. Consequently, we explore the feasibility of employing a simpler and more explicit method to delineate the structural features of molecules and protein pockets. Based on this understanding, we confirm that large language models (LLMs) can effectively capture and utilize positional information related to molecules and protein pockets, enabling them to generate new molecular structures within target pockets. Herein, we present 3DSMILES-GPT, an innovative token-only framework designed for explicit 3D molecular generation, firmly rooted in LLM. As shown in Fig. 1, the architectural blueprint of 3DSMILES-GPT centers on a transformer decoder. By framing the task of generating 2D and 3D structures as a natural language generation endeavor, 3DSMILES-GPT encodes atomic 3D coordinates as tokens, facilitating the acquisition of molecular 2D and 3D information. To maximize the intrinsic capabilities of LMs, our methodology begins with the pretraining phase of 3DSMILES-GPT using an extensive dataset with drug-like molecules. After fine-tuning on a specified protein–ligand dataset, we integrate surface atomic coordinates from pockets along with ligand molecules. Furthermore, to enhance the model's ability to extract information from protein pockets, we introduce a protein encoder as a detachable modular component. Additionally, the application of reinforcement learning methodologies enables the refinement of generated molecules across a diverse range of properties. The experimental results demonstrate that, compared to existing state-of-the-art (SOTA) methodologies, 3DSMILES-GPT achieves optimal performance across 8 out of 10 benchmark metrics including bioactivity, drug-likeness, and synthetic accessibility. Moreover, the targeted case studies on 5 distinct protein targets further elucidate its efficacy in generating drug-like molecules with robust binding strength in practical scenarios.
For molecules bound within protein pockets, each type of bond lengths exhibits a robust distribution due to the constrained degrees of freedom resulting from lower flexibility. These bond lengths do not vary significantly across different constrained pockets. To evaluate the performance, we compared the distributions of some common bond lengths between the generated molecules and training molecules across various types of chemical bonds, using the Jensen–Shannon divergence (JSD)43 as a quantification metric. We categorized the data based on the types of chemical bonds to visualize the distributions of bond lengths (Fig. 2a). It shows that our method maintains a most balanced overall performance with no significant weaknesses across all types of bonds, namely, no JSD values smaller than 0.4. In the first two groups shown in Fig. 2a, which include bonds with carbon atoms that form the basic skeleton of drug molecules, 3DSMILES-GPT generally achieved suboptimal results, slightly inferior to TargetDiff, a SOTA method based on diffusion models. Positively, our model achieved the best results for most chalcogen-related chemical bonds. Furthermore, when considering other types of chemical bonds, we found that the bond length quality, as measured by JSD, is consistently distributed between 0.4 and 0.6, indicating low deviation and controllable variation across different bond types. These bonds occur less frequently in drug molecules compared to other categories, but they remain crucial in key functional groups such as nitro and sulfonic groups, which are often found in antibacterial drugs. The success of 3DSMILES-GPT in these bonds is likely attributable to the GPT model's capability to retain and recall information from scarce samples, allowing it to accurately reproduce the relative positions of atoms in molecules containing these relatively rare groups.
As shown in Table S1† in more detail, our model achieves the best performance in over one-third of the bond length types and demonstrates comparably close to or superior performance in other bond length types compared to other models. On a global scale, our model's predictive capability for bond lengths in pocket generation tasks still lags behind TargetDiff, this discrepancy may be attributed to potential forgetting phenomena in transfer learning. The results indicate that, when compared to other methods specifically designed for pocket generation tasks utilizing GNNs or language models, the performance of our model is essentially equivalent and, in some instances, even superior. This finding underscores that using atomic coordinates as tokens for prediction can effectively reproduce the distributions of molecular bonds.
For a broader assessment, we employed PoseBusters,44 a suite designed to examine the physical and chemical inconsistencies in docking and molecular generation. It offers diverse metrics for inspecting potential errors in molecular conformations. Thus, we aim for our generated molecules to achieve high pass rates across those all metrics evaluating validity, sub-structure and stereochemistry plausibility, rather than excelling solely in specific ones. As shown in Fig. 2b and Table S2,† our model consistently achieves over an 85% pass rate across multiple metrics, indicating that the majority of generated molecules adhere to the physical and chemical plausibility as observed in natural states. In contrast, other models such as Lingo3DMol and TargetDiff, while achieving optimal performance in individual metrics, exhibit subpar performance in certain specific metrics like bond angles or steric clashes, with pass rates ranging from 50% to 70%. Compared to Pocket2Mol, our model performs almost equally well across multiple metrics, achieving a pass rate of over 90% in various independent metrics. Furthermore, the bond length analysis using PoseBusters supports the findings of the bond length analysis based on JSD. It is evident that both 3DSMILES-GPT and TargetDiff achieved nearly 100% pass rates in bond length metrics, while other models only exceeded 80%. This demonstrates that the majority of molecules generated by 3DSMILES-GPT have bond lengths within an acceptable range, consistent with the results presented in Fig. 2a, where the majority of chemical bonds relative to the reference molecule exhibited JSD values evenly distributed within the desirable range of 0.4–0.6.
In addition, while Pocket2Mol generated molecules with low molecular weights, molecules generated by 3DSMILES-GPT whose molecular weights were closer to the reference (reported real active molecules), thereby better aligning with the requirements of real-world drug discovery scenarios (Table 1). Consequently, our superior performance across various metrics and the overall higher pass rate with PoseBusters underscore the robustness and applicability of our approach. In conclusion, 3DSMILES-GPT demonstrates commendable performance in generating molecular conformations.
Metrics | Ref. | Pocket2Mol | TargetDiff | Lingo3DMol | 3DSMILES-GPT |
---|---|---|---|---|---|
Mean Vina score (↓) | −7.45 | −7.15 | −7.11 | −7.68 | −7.72 |
Mean QED (↑) | 0.48 | 0.57 | 0.57 | 0.26 | 0.76 |
Mean SAS (↓) | 3.43 | 3.16 | 4.33 | 4.51 | 3.07 |
Drug-like molecules % (↑) | 74% | 94% | 81% | 30% | 100% |
Mol size | 22.75 | 17.74 | 22.65 | 40.68 | 23.71 |
Mol weight | 332.35 | 241.25 | 298.41 | 480.50 | 329.10 |
Validity (↑) | — | 1.00 | 0.97 | 0.99 | 0.99 |
Diversity (↑) | — | 0.96 | 0.96 | 0.92 | 0.89 |
BR % (↑) | — | 48% | 48% | 58% | 53% |
BR-QED (↑) | — | 0.56 | 0.59 | 0.27 | 0.76 |
BR-SAS (↓) | — | 3.52 | 4.78 | 4.51 | 3.10 |
Time/s (↓) | — | 13.63 | 12.19 | 1.32 | 0.45 |
Molecules with large size are more likely to occupy protein pockets, leading to higher Vina scores. This phenomenon emphasizes the importance of considering the physicochemical properties of generated molecules comprehensively. As demonstrated by Feng et al.,42 certain large, multi-ring structured molecules are unsuitable for many cases of drug development. Therefore, a thorough evaluation of the generated molecules is essential. An examination of Table 1 indicates that the molecules generated by our model align more closely with authentic molecules in terms of molecular weight compared to other baseline methods. Regarding molecular size, TargetDiff closely resembles authentic molecules, exhibiting similar characteristics with our model. Conversely, the molecules generated by Pocket2Mol and Lingo3Dmol display undersized and oversized dimensions, respectively, in both molecular size and weight.
Subsequent analyses involved a comparison of the SAS and QED of the generated molecules. As shown in Table 1, our model demonstrates significant advantages in both QED and SAS metrics compared to alternative pocket-aware molecular generation approaches. Notably, our model exhibits an approximate 33% improvement in QED performance over the top-performing baseline, Pocket2Mol. This highlights the superior drug-like characteristics of the molecules generated by 3DSMILES-GPT, thereby enhancing their potential pharmaceutical utility. Furthermore, when compared to other baseline models, the molecules generated by 3DSMILES-GPT also demonstrate higher SAS values, indicative of improved synthetic feasibility. Additionally, in comparison to other methodologies, our approach demonstrates a superior molecular generation speed, of only 0.45 s per generation, as evaluated using an NVIDIA Tesla V100 GPU.
To further explore the interplay among the QED, SAS, and various other molecular properties, we utilized heatmap visualization (Fig. 3). Fig. 3a illustrates the relationship between the quantity of molecules and their corresponding QED and SAS values. Notably, a substantial proportion of molecules generated by 3DSMILES-GPT cluster in the bottom-left quadrant, indicating a prevalence of molecules exhibiting heightened QED and diminished SAS compared to other models. Furthermore, we investigated the influence of molecular weight on this relationship, revealing that the molecules generated by Pocket2Mol, characterized by elevated QED and reduced SAS, tend to possess smaller molecular weights (Fig. 3b). Subsequently, we explored the correlation between the Vina scores and QED/SAS. As depicted in Fig. 3c, the notably lighter coloration in the bottom-left quadrant for 3DSMILES-GPT indicates lower Vina scores and, consequently, reduced binding energies.
In our endeavor to create molecules, our objective is to generate compounds with properties that surpass those of currently available ones. To assess the model's effectiveness in achieving this objective, we thoroughly examined the molecules generated by each model, using the ground truth molecules from the test set as a reference point. We refer to results where the affinity is superior to the reference molecules as “Better than References” (BR). The analysis revealed that our model exhibits reduced diversity compared to others, a finding that aligns with our initial expectations. This decline in diversity can be attributed to the imposition of constraints related to physicochemical properties during the training phase of 3DSMILES-GPT, coupled with additional restrictions on QED and logP during the process of molecule generation, ultimately resulting in a lower diversity of generated molecules.
We quantified the number of the molecules generated by 3DSMILES-GPT that achieved lower Vina scores compared to the reference molecules. Notably, 53% of the molecules generated by 3DSMILES-GPT exhibited lower Vina scores compared to the reference molecules, while Pocket2Mol and TargetDiff achieved 48%. However, although the molecules generated by Lingo3DMol often exhibit higher affinity than the reference molecules, they tend to cluster within a narrow range according to the Vina scores (Fig. 3). This might be suitable for certain drug discovery tasks, but if the reference molecules generally have high affinity, such as with kinase targets like ATK1 and CDK2, the proportion of BR molecules may decrease. On the other hand, 3DSMILES-GPT shows the ability to explore chemical space with higher affinity for the target, which is an advantage of our model. We also computed the average QED and SAS of molecules from the BR set (Table 1), consistently demonstrating that 3DSMILES-GPT keeps superior performance, particularly in QED, with an improvement of approximately 33%.
In summary, 3DSMILES-GPT demonstrates the capability to generate molecules with higher binding affinity and improved QED and SAS metrics. Notably, it can produce conformations comparable to those obtained through redocking without requiring the redocking process, an outstanding capability of direct generation with physical conformation, which is absent in other models.
Beyond affinity, the purpose of generating molecules within specific target pockets also involves comparing the structural and binding mode similarity of the generated molecules to known ligands. As shown in Fig. 5, we selected the highest affinity molecules generated by each model to demonstrate their binding modes. Among the baseline models, Pocket2Mol and TargetDiff tend to generate either simple aromatic ring derivatives or complex macrocyclic compounds, lacking distinct target specificity. The molecules generated by Lingo3DMol are excessively large compared to the original ligands, consistent with the average molecular weight of 480 Daltons obtained in the test set results. Furthermore, in the case of molecule within the CDK2 pocket, there are issues of fragmentation and clash with the protein pocket, caused by large molecule size and unreasonable bond angles (Fig. S1†). Compared to other baseline models, the molecules generated by 3DSMILES-GPT exhibit more reasonable structures, and their binding modes within the pocket are relatively close to those of the original ligands. However, in the DDR1 pocket, the binding modes of the molecules are more exterior compared to that of the original ligand, which may explain why 3DSMILES-GPT does not show a significant advantage in affinity for the DDR1 target over other baseline models.
Fig. 5 The binding modes of the molecules with optimal affinity generated by each model for specific targets, and the comparison with the binding modes of the original ligands. |
Overall, the tests on specific target pockets demonstrated that 3DSMILES-GPT, compared to previous baseline models, can generate more molecules that meet drug-likeness criteria with high affinity for the target, while also exhibiting more reasonable structures and higher structural specificity for different targets.
(1) |
pij = {G(A(dij, tij; u, v), μs, σs)|s ∈ [1, D]} | (2) |
(3) |
A(d, t; u, v) = utd + vt | (4) |
Therefore, the information of i and j can be computed as:
qij = W1GELU(W2pij) | (5) |
In the present context, Pθ(M|C) denotes the initial policy governing our model, with C representing the protein pocket and M signifying the molecule. Simultaneously, D0 stands for the initial fine-tuning dataset. At each iteration, K molecules are sampled for every protein pocket, with those demonstrating favorable properties being incorporated into the fine-tuning dataset Dt at the t time step for subsequent iterations. Throughout each fine-tuning iteration, policy gradient methods are employed to iteratively refine the policy Pθ(M|C) of the model.
For fine-tuning, the CrossDocked202052 dataset was employed following the Pocket2mol methodology, with poses featuring RMSD greater than 2 Å discarded. A 6 Å residue perimeter surrounding the ligand was isolated as pocket data, and the coordinates of the pocket surface were computed utilizing the MSMS.53 The resultant coordinates underwent sparsification via pymesh.
Further data processing was conducted to meet the model's input requirements. Initially, the QED and logP (partition coefficient) values were computed. Molecules with QED exceeding 0.5 or logP values falling between −1 and 3 were assigned a label of 1, while those outside these ranges were labeled as 0. During the fine-tuning phase, we employed a similar approach to label the training data and augmented it with Vina score labels. Molecules with a Vina score less than −0.75 were labeled as 1, while those with a score equal to or greater than −0.75 were labeled as 0.
For the 2D molecular structure representation, SMILES notation was utilized, with SMILES sequences encoded at the character byte level instead of byte-level tokenization.54 The initial vocabulary comprised 72 characters extracted from the SMILES alphabet. Following tokenization, it was segmented into 1000 most commonly encountered tokens.
To address 3D molecular structures, data augmentation techniques were employed to instill three-dimensional equivariance into the model. This entailed random translation and rotation of the 3D structures, with each coordinate represented by a distinct token. For the details of data augmentation, each molecule undergoes random rotations around the X, Y, and Z axes, with the rotation angles uniformly sampled from 0° to 360° independently for each axis. This approach ensures that the model is exposed to molecules in all possible orientations, thereby promoting rotational invariance. Moreover, random translations are applied to the molecular coordinates, using translation vectors uniformly sampled within a range of −10 Å to 10 Å along each axis. This specific range was selected to introduce subtle positional variations without significantly displacing the molecules from their original locations. To preserve the consistency of the augmented dataset, the same augmentation techniques are uniformly applied to all molecules, maintaining the original distribution of molecular properties.
(6) |
This objective was accomplished by iteratively refining the loss function through gradient descent until convergence.
In the generation stage, we combined the processed protein pocket information with the specified molecular properties, forming the input for the model. The autoregressive process for generating smiles and coordinates strings follows eqn (7):
(7) |
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4sc06864e |
‡ Equivalent authors. |
This journal is © The Royal Society of Chemistry 2025 |