Issue 6, 2024

Mining patents with large language models elucidates the chemical function landscape

Abstract

The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule–function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.

Graphical abstract: Mining patents with large language models elucidates the chemical function landscape

Supplementary files

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article.

View this article’s peer review history

Article information

Article type
Paper
Submitted
18 Jan 2024
Accepted
03 May 2024
First published
07 May 2024
This article is Open Access
Creative Commons BY license

Digital Discovery, 2024,3, 1150-1159

Mining patents with large language models elucidates the chemical function landscape

C. W. Kosonocky, C. O. Wilke, E. M. Marcotte and A. D. Ellington, Digital Discovery, 2024, 3, 1150 DOI: 10.1039/D4DD00011K

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements