3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model

Jike Wang; Hao Luo; Rui Qin; Mingyang Wang; Xiaozhe Wan; Meijing Fang; Odin Zhang; Qiaolin Gou; Qun Su; Chao Shen; Ziyi You; Liwei Liu; Chang-Yu Hsieh; Tingjun Hou; Yu Kang

doi:10.1039/D4SC06864E

3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model†

Jike Wang,‡^a Hao Luo,‡^a Rui Qin,

‡^a Mingyang Wang,^a Xiaozhe Wan,^b Meijing Fang,^a Odin Zhang,^a Qiaolin Gou,^a Qun Su,^a Chao Shen,

^a Ziyi You,^a Liwei Liu,*^b Chang-Yu Hsieh,

*^a Tingjun Hou

*^a and Yu Kang

*^a

Author affiliations

* Corresponding authors

^a College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
E-mail: yukang@zju.edu.cn, tingjunhou@zju.edu.cn, kimhsieh@zju.edu.cn

^b Advanced Computing and Storage Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd, Nanjing 210000, Jiangsu, China
E-mail: liuliwei5@huawei.com

Abstract

The generation of three-dimensional (3D) molecules based on target structures represents a cutting-edge challenge in drug discovery. Many existing approaches often produce molecules with invalid configurations, unphysical conformations, suboptimal drug-like qualities, limited synthesizability, and require extensive generation times. To address these challenges, we present 3DSMILES-GPT, a fully language-model-driven framework for 3D molecular generation that utilizes tokens exclusively. We treat both two-dimensional (2D) and 3D molecular representations as linguistic expressions, combining them through full-dimensional representations and pre-training the model on a vast dataset encompassing tens of millions of drug-like molecules. This token-only approach enables the model to comprehensively understand the 2D and 3D characteristics of large-scale molecules. Subsequently, we fine-tune the model using pair-wise structural data of protein pockets and molecules, followed by reinforcement learning to further optimize the biophysical and chemical properties of the generated molecules. Experimental results demonstrate that 3DSMILES-GPT generates molecules that comprehensively outperform existing methods in terms of binding affinity, drug-likeness (QED), and synthetic accessibility score (SAS). Notably, it achieves a 33% enhancement in the quantitative estimation of QED, meanwhile the binding affinity estimated by Vina docking maintaining its state-of-the-art performance. The generation speed is remarkably fast, with the average time approximately 0.45 seconds per generation, representing a threefold increase over the fastest existing methods. This innovative 3DSMILES-GPT approach has the potential to positively impact the generation of 3D molecules in drug discovery.

This article is part of the themed collection: 2024 Chemical Science HOT Article Collection

Chemical Science

3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model†

Abstract

Supplementary files

Article information

Download Citation

Permissions

3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model

Social activity

Search articles by author

Spotlight

Advertisements