3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model

Abstract

The generation of three-dimensional (3D) molecules based on target structures represents a cutting-edge challenge in drug discovery. Many existing approaches often produce molecules with invalid configurations, unphysical conformations, suboptimal drug-like qualities, limited synthesizability, and require extensive generation times. To address these challenges, we present 3DSMILES-GPT, a fully language-model-driven framework for 3D molecular generation that utilizes tokens exclusively. We treat both two-dimensional (2D) and 3D molecular representations as linguistic expressions, combining them through full-dimensional representations and pre-training the model on a vast dataset encompassing tens of millions of drug-like molecules. This token-only approach enables the model to comprehensively understand the 2D and 3D characteristics of large-scale molecules. Subsequently, we fine-tune the model using pair-wise structural data of protein pockets and molecules, followed by reinforcement learning to further optimize the biophysical and chemical properties of the generated molecules. Experimental results demonstrate that 3DSMILES-GPT generates molecules that comprehensively outperform existing methods in terms of binding affinity, drug-likeness (QED), and synthetic accessibility score (SAS). Notably, it achieves a 33% enhancement in the quantitative estimation of QED, meanwhile the binding affinity estimated by Vina docking maintaining its state-of-the-art performance. The generation speed is remarkably fast, with the average time approximately 0.45 seconds per generation, representing a threefold increase over the fastest existing methods. This innovative 3DSMILES-GPT approach has the potential to positively impact the generation of 3D molecules in drug discovery.

Graphical abstract: 3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model

Supplementary files

Article information

Article type
Edge Article
Submitted
09 okt 2024
Accepted
03 dec 2024
First published
04 dec 2024
This article is Open Access

All publication charges for this article have been paid for by the Royal Society of Chemistry
Creative Commons BY-NC license

Chem. Sci., 2025, Advance Article

3DSMILES-GPT: 3D molecular pocket-based generation with token-only large language model

J. Wang, H. Luo, R. Qin, M. Wang, X. Wan, M. Fang, O. Zhang, Q. Gou, Q. Su, C. Shen, Z. You, L. Liu, C. Hsieh, T. Hou and Y. Kang, Chem. Sci., 2025, Advance Article , DOI: 10.1039/D4SC06864E

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements