Mining patents with large language models elucidates the chemical function landscape

Clayton W. Kosonocky; Claus O. Wilke; Edward M. Marcotte; Andrew D. Ellington

doi:10.1039/D4DD00011K

Mining patents with large language models elucidates the chemical function landscape†

Clayton W. Kosonocky,

^a Claus O. Wilke,

^b Edward M. Marcotte

^ac and Andrew D. Ellington

*^ac

Author affiliations

* Corresponding authors

^a Department of Molecular Biosciences, University of Texas at Austin, Austin, TX 78705, USA

^b Department of Integrative Biology, University of Texas at Austin, Austin, TX 78705, USA

^c Center for Systems and Synthetic Biology, University of Texas at Austin, Austin, TX 78705, USA

Abstract

The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule–function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

Digital Discovery

Mining patents with large language models elucidates the chemical function landscape†

Abstract

Supplementary files

Transparent peer review

Article information

Download Citation

Permissions

Mining patents with large language models elucidates the chemical function landscape

Social activity

Search articles by author

Spotlight

Advertisements