Clayton W.
Kosonocky
a,
Claus O.
Wilke
b,
Edward M.
Marcotte
ac and
Andrew D.
Ellington
*ac
aDepartment of Molecular Biosciences, University of Texas at Austin, Austin, TX 78705, USA
bDepartment of Integrative Biology, University of Texas at Austin, Austin, TX 78705, USA
cCenter for Systems and Synthetic Biology, University of Texas at Austin, Austin, TX 78705, USA
First published on 7th May 2024
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule–function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
Humans have long been documenting chemicals and their effects, and it is reasonable to assume functional relationships are embedded in language itself. Text-based functional analysis has been paramount for our understanding of the genome through Gene Ontology terms.7 Despite its potential, text-based functional analysis for chemicals has been largely underexplored. This is in part due to the lack of high-quality chemical function datasets but is more fundamentally due to the high multi-functionality of molecules, which is less problematic for genes and proteins. High-quality chemical function datasets have been challenging to generate due to the sparsity and irregularity of functional information in chemical descriptions, patents, and literature. Recent efforts at creating such datasets tend to involve consolidation of existing curated descriptive datasets.8–12 Similarly, keyword-based function extraction partially solves the function extraction problem by confining its scope to singular predetermined functionality, but it fails at broadly extracting all relevant functions for a given molecule.13 Given their profound success in text summarization, Large Language Models (LLMs) may be ideal candidates to broadly extract functional information of molecules from patents and literature, a task that remains underexplored.14–16 This is especially promising for making use of the chemical patent literature, an abundant and highly specific source of implicit chemical knowledge that has been largely inaccessible due to excessive legal terminology.17,18 LLMs have been used in this way to help evaluate functional relevance of results from a machine learning-based chemical similarity search.19 This may allow for the creation of a large-scale dataset that effectively captures the text-based chemical function landscape.
We hypothesize that a sufficiently large chemical function dataset would contain a text-based chemical function landscape congruent with chemical structure space, effectively approximating the actual chemical function landscape. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners.20 This hypothesis is further based on the observation that function is reported frequently enough in patents and scientific articles for most functional relationships to be contained in the corpus of chemical literature.21 To evaluate this hypothesis, we set out to create a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule–function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100 K molecules from their corresponding 188 K unique patents. The CheF dataset was found to be of high quality, demonstrating the effectiveness of LLMs for extracting functional information from chemical patents despite not being explicitly trained to do so. Using this dataset, we carry out a series of experiments alluding to the notion that the CheF dataset contains a text-based functional landscape that simulates the actual chemical function landscape due to its congruence with chemical structure space. We then demonstrate through several examples that this text-based functional landscape can be harnessed to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
The patent title, abstract, and first 3500 characters of the description were summarized into concise functional labels using ChatGPT (gpt-3.5-turbo) with no further fine-tuning from July 15th, 2023, chosen for low cost and high speed. Cost per molecule was $0.005 using gpt-3.5-turbo. The first 3500 characters of the description were included because the start of the patent description typically contains relevant background, mechanistic information, and/or a summary of the claim. Responses from ChatGPT were converted into sets of labels and linked to their associated molecules. Summarizations were cleaned, split into individual words, converted to lowercase, and converted to singular if plural. Single-character labels were removed. The cleaned dataset resulted in 29854 unique labels for 99182 molecules. Fetching patent information and summarizing with ChatGPT, this method's bottleneck, took 6 s per molecule with 16 CPUs in parallel. This could be sped up to 3.9 s by summarizing per-patent rather than per-molecule to avoid redundant summarizations, and even further to 2.6 s by using only US and WO patents.
To consolidate labels by semantic meaning, the vocabulary was embedded with OpenAI's text-embedding-ada-002 and clustered to group labels by embedding similarity. DBSCAN clustering was performed on the embeddings with a sweeping epsilon.25 The authors chose the epsilon for optimal clustering, set to be at the minimum number of clusters without quality degradation (e.g., avoiding the merging of antiviral, antibacterial, and antifungal). The optimal epsilon was 0.34 for the dataset herein, consolidating down from 29854 to 20030 labels. Representative labels for each cluster were created using gpt-3.5-turbo. The labels from a very large cluster of only IUPAC structural terms were removed to reduce non-generalizable labels. Despite this, some structural terms remained which correspond either to receptor names (i.e., ATP, 5-HT), or to chemical moieties (i.e., aryl, azetidine). Labels appearing in <50 molecules were dropped to ensure sufficient predictive power. Single character labels were then dropped. This resulted in a 99182-molecule dataset with 1522 unique functional labels, deemed the Chemical Function (CheF) dataset.
For each molecule–associated patent in the CheF dataset, the patent title, abstract, and description were scraped from Google Patents and cleaned. ChatGPT (gpt-3.5-turbo) was used to generate 1–3 functional labels describing the patented molecule given its unstructured patent data (Fig. 1 and Table S3†). The LLM-assisted function extraction method's success was validated manually across 1738 labels generated from a random 200 CheF molecules. Of these labels, 99.6% had correct syntax and 99.8% were relevant to their respective patent. In the SureChEMBL database, molecules can be linked to patents in which they serve as intermediates to the final patented molecule. Because of this, 77.9% of the labels correctly describe the labeled molecule's function. However, if considering associations through synthesis, then 98.2% of the molecules are correctly described by their functional labels. This shows that the deviation from near-perfect accuracy is due to the molecule–patent associations rather than the ChatGPT-assisted functional extraction.
Fig. 1 Chemical function dataset creation. LLM extracts molecular functional information present in patents into concise labels; see Fig. S2† for an example. Chemical functional labels were then cleaned with algorithmic-, embedding-, and LLM-based methods. |
The LLM-assisted method resulted in 104607 functional labels for the 100 K molecules (see Fig. S3† for the top terms). These were too many labels to yield any predictive power, so measures were taken to consolidate these labels into a concise vocabulary. The labels were cleaned, reducing the number of labels to 39854, and further consolidated by embedding each label with a language model (OpenAI's text-embedding-ada-002) to group grammatically dissimilar yet semantically similar labels together. The embeddings were clustered with DBSCAN using a cutoff that minimized the number of clusters without cluster quality deterioration (e.g., avoiding the grouping of antiviral, antibacterial, and antifungal) (Fig. S4†). Each cluster was summarized with GPT-4 to obtain a single representative cluster label.
The embedding-based clustering and summarization process was validated across the 500 largest clusters. Of these, 99.2% contained semantically common elements and 97.6% of the cluster summarizations were accurate and representative of their constituent labels. These labels were mapped back to the CheF dataset, resulting in 19616 labels. To ensure adequate predictive power, labels appearing in less than 50 molecules and labels with only a single character were dropped. The final CheF dataset consisted of 99182 molecules and their 1522 descriptive functional labels. A comparison to similar datasets is made in Table 1, outlining the unique scalability of the CheF dataset.
Dataset | Curr. size | Scaleup size | Text-type | S/F separate | Data source |
---|---|---|---|---|---|
a Comparison of CheF to existing chemical-text datasets ChEBI and ChemFOnt8,9 by current size (# molecules), maximum automated scaleup size (# molecules), text-type, whether or not structure and function are separate in the text, and the data source used for dataset construction. Both ChEBI and ChemFOnt were built from existing datasets with additional manual curation and annotation, limiting potential automated scaleup size. In contrast, the method used to build CheF scales readily, allowing for a potential dataset size of 32M + molecules. | |||||
ChEBI | 103 K | 103 K+ | Long text | No | DB agg./manual |
ChemFOnt | 342 K | 1M+ | Labels | Yes | DB agg./manual |
CheF (ours) | 100 K | 32M+ | Labels | Yes | LLM-sum. patents |
To evaluate this hypothesis, we embedded the CheF dataset in structure space by converting the molecules to daylight molecular fingerprints (binary vectors representing a molecule's substructures), visualized with t-distributed Stochastic Neighbor Embedding (t-SNE) (Fig. 2 and S5†).23 Then, to determine if the CheF functional labels clustered in this structural space, the maximum fingerprint Tanimoto similarity was computed between the fingerprint vectors of each molecule containing a given label; this approach provides a measure of structural similarity between molecules that have the same functional label.29 This value was compared to the maximum similarity computed from a random equal-sized set of molecules to determine significance. Remarkably, 1261 of the 1522 labels were found to cluster significantly in structural space (independent t-tests per label, false-discovery rate of 5%). To give an idea of the meaning of this correlation, inherent clustering was visualized for the labels ‘hcv’ (hepatitis C virus), ‘electroluminescence’, ‘serotonin’, and ‘5-HT’ (5-hydroxytryptamine, the chemical name for serotonin) (Fig. 2). For the label ‘electroluminescence’, there was one large cluster containing almost only highly conjugated molecules. For ‘hcv’, there were multiple distinct communities representing antivirals targeting different mechanisms of HCV replication (Fig. S6†). Clusters were observed for NS5A inhibitors, NS3 macrocyclic and peptidomimetic protease inhibitors, and nucleoside NS5B polymerase inhibitors. The observed clustering of functional labels in structure space provided evidence that the CheF dataset labels had accurately captured structure–function relationships, validating our initial hypothesis.
Fig. 2 Text-based functional labels cluster in structural space. For each of the labels “hcv”, “electroluminescence”, “serotonin”, and “5-HT”, molecules in the CheF dataset were mapped by their molecular fingerprints and colored based on whether the selected label was present in their set of functional descriptors. The max fingerprint Tanimoto similarity was computed between the fingerprint vectors of each molecule containing a given label and was compared against the max fingerprint Tanimoto similarity from a random equal-sized set of molecules to determine significance to a random control. Many of the labels strongly cluster in structural space, demonstrating that CheF accurately captures structure–function relationships. See Fig. S5† for examples with more labels. |
Fig. 3 Label co-occurrences reveal the text-based chemical function landscape. Node sizes correspond to number of connections, and edge sizes correspond to co-occurrence frequency in the CheF dataset. Modularity-based community detection was used to obtain 19 distinct communities. The communities broadly coincided with the semantic meaning of the contained labels, the largest 10 of which were summarized to representative categorical labels (Tables S4–S6†). |
Modularity-based community detection isolates tightly interconnected groups within a graph, distinguishing them from the rest of the graph. This method was applied to the label co-occurrence graph, with the resulting clusters summarized with GPT-4 into representative labels for unbiased semantic categorization (Tables S4–S6†). The authors curated the summarized labels for validity and found them representative of the constituent labels; these were then further consolidated for succinct representation of the semantic categorization (Table S4†). This revealed a semantic structure in the co-occurrence graph, where distinct communities such as ‘Electronic, Photochemical, & Stability’ and ‘Antiviral & Cancer’ could be observed (Fig. 3). Within communities, the fine-grained semantic structure also appeared to be coherent. For example, in the local neighborhood around ‘hcv’, the labels ‘antiviral’, ‘ns’ (nonstructural), ‘hbv’ (hepatitis B virus), ‘hepatitis’, ‘replication’, and ‘protease’ were found, all of which are known to be semantically relevant to hepatitis C virus (Fig. 3). The graph of patent-derived molecular functions is a visual representation of the text-based chemical function landscape and represents a potentially valuable resource for linguistic evaluation of chemical function and ultimately drug discovery.
Fig. 4 Coherence of the text-based chemical function landscape in structure space. To assess the alignment of text-based functional relationships with structural relationships, for each of the labels “hcv”, “electroluminescence”, “serotonin”, and “5-HT”, the max fingerprint Tanimoto similarity from each molecule containing a given label to each molecule containing any of its 10 most frequently co-occurring labels (<1000 total abundance) was compared against the max fingerprint Tanimoto similarity to a random subset of molecules of the same size. See Fig. S5† for examples with more labels. |
This model can thus be used to comprehensively annotate chemical function, even when existing annotations are fragmented or incomplete. As an example, for a known hepatitis C antiviral the model strongly predicted ‘antiviral’, ‘hcv’, ‘ns’ (nonstructural) (94%, 93%, 70% respectively) while predicting ‘protease’ and ‘polymerase’ with low confidence (0.02%, 0.00% respectively) (Fig. 5b). The low-confidence ‘protease’ and ‘polymerase’ predictions suggested that the likely target of this drug was the nonstructural NS5A protein, rather than the NS2/3 proteases or NS5B polymerase, a hypothesis that has been validated outside of patents in the scientific literature.30
The ability to comprehensively predict functional profiles allows for the discovery of new drugs. For example, the label ‘serotonin’ was used to query the test set predictions, and a ranked list of the 10 molecules most highly predicted for ‘serotonin’ were obtained (Fig. 5c). All ten of these were patented in relation to serotonin: 8 were serotonin receptor ligands (5-HT1, 5-HT2, 5-HT6) and 2 were serotonin reuptake inhibitors. Similarly, the synonymous label ‘5-HT’ was used as the query and the top 10 molecules were again obtained (Fig. 5d). Of these, seven were patented in relation to serotonin (5-HT1, 5-HT2, 5-HT6), four of which were also found in the aforementioned ‘serotonin’ search. The remaining three molecules were patented without reference to the serotonin receptor, but were instead patented for depressant, anti-anxiety, and memory dysfunction relieving effects, all of which have associations with serotonin and its receptor. The identification of known serotonin receptor ligands, together with the overlapping results across synonymous labels, provides an internal validation of the model. Additionally, these search results suggest experiments in which the “mispredicted” molecules may bind to serotonin receptors or otherwise be synergistic with the function of serotonin, thereby demonstrating the practical utility of moving with facility between chemicals and their functions.
To examine the best model's capability in drug repurposing, functional labels were predicted for 3242 stage-4 FDA approved drugs (Fig. S7†).31 Of the 16 drugs most highly predicted for ‘hcv’, 15 were approved Hepatitis C Virus (HCV) antivirals. Many of the mispredictions in the top 50 were directly relevant to HCV treatment including 8 antivirals and 8 polymerase inhibitors. The remaining mispredictions included 3 ACE inhibitors and 2 BTK inhibitors, both of which are peripherally associated with HCV through liver fibrosis mitigation and HCV reactivation, respectively.32,33 Beyond showing its power, this example suggests that functional label-guided drug discovery may serve as an additional approach for antiviral repurposing which could help contribute to mitigating future pandemics.
To leverage the chemical function landscape for drug discovery, several models were trained and benchmarked on the CheF dataset to predict functional labels from molecular fingerprints (Table S7†). The top-performing model was utilized for practical applications such as unveiling an undisclosed drug mechanism, identifying novel drug candidates, and mining FDA-approved drugs for repurposing and combination therapy uses. Since the CheF dataset is scalable to the entire 32M + SureChEMBL database, we anticipate that many of these predictions will only get better into the future.
The CheF dataset inherently exhibits a bias toward patented molecules. This implies sparse representation of chemicals with high utility but low patentability and allows for false functional relationships to arise from prophetic claims. Additionally, by restricting the dataset to chemicals with <10 patents, it neglects important well-studied molecules like Penicillin. However, we found the impact of this omission to be negligible (Table S8†). The inclusion of over-patented chemicals, like those in Table S2†, could be accomplished through supplementation from other data sources like PubChem, PubMed, or International Patent Classification categories (Table S1†). These over-patented molecules could also be included through keyword filtering or by only using the most common terms for each molecule. Increasing label quality and ignoring extraneous claims might be achieved through an LLM fine-tuned on high-quality examples or through the organization of functional labels into an ontology. While it is possible that some of the representative terms created with GPT-4 capture hierarchical relationships, it is not guaranteed from the method used herein. Further quality increases may result from integration of well-documented chemical–gene and chemical–disease relationships from PubChem into CheF. As the scope of the manuscript lies with using LLMs to mine functionality from text, we leave dataset merging and supplementation to future work.
The CheF dataset was created from patented molecules. This includes the active molecules responsible for the patent's existence, but also derivatives that may or may not be active. Models trained on the CheF dataset are then learning a coarse-grained map of the chemical function landscape rather than a fine-grained map with activity cliffs. As such, we foresee CheF-trained models being used to annotate broad functionality at a high-level, capturing general chemical trends, rather than providing precise guarantees of activity.
Consideration of ML chemistry dual-use often focuses on the identification of toxic chemicals and drugs of abuse. To test the dual use potential of CheF, functional labels for the chemical weapons VX and mustard gas were predicted from our model and were found to contain no obvious indications of malicious properties. In contrast, drugs of abuse were more easily identifiable, as the development of neurological compounds remains a lucrative objective. 5-MeO-DMT, LSD, fentanyl, and morphine all had functional labels of their primary mechanism predicted with moderate confidence. However, benign molecules also predicted these same labels, indicating that it may be quite challenging to intentionally discover novel drugs of abuse using the methods contained herein.
The analysis herein suggests that a sufficiently large chemical function dataset contains a text-based function landscape that approximates the actual chemical function landscape. Further, we demonstrate one of the first examples of functional label-guided drug discovery, made possible utilizing state-of-the-art advances in machine learning. Models in this paradigm have the potential to automatically annotate chemical function, examine non-obvious features of drugs such as side effects, and down-select candidates for high-throughput screening. Moving between textual and physical spaces represents a promising paradigm for drug discovery in the age of machine learning.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00011k |
This journal is © The Royal Society of Chemistry 2024 |