Ya
Chen
*a,
Cara
Rosenkranz
b,
Steffen
Hirte
ac and
Johannes
Kirchmair
a
aDepartment of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, 1090 Vienna, Austria. E-mail: ya.chen@univie.ac.at
bCenter for Bioinformatics (ZBH), Universität Hamburg, 20146 Hamburg, Germany
cVienna Doctoral School of Pharmaceutical, Nutritional and Sport Sciences (PhaNuSpo), University of Vienna, 1090 Vienna, Austria
First published on 16th June 2022
Covering: up to 2021
The structural core of most small-molecule drugs is formed by a ring system, often derived from natural products. However, despite the importance of natural product ring systems in bioactive small molecules, there is still a lack of a comprehensive overview and understanding of natural product ring systems and how their full potential can be harnessed in drug discovery and related fields. Herein, we present a comprehensive cheminformatic analysis of the structural and physicochemical properties of 38662 natural product ring systems, and the coverage of natural product ring systems by readily purchasable, synthetic compounds that are commonly explored in virtual screening and high-throughput screening. The analysis stands out by the use of comprehensive, curated data sets, the careful consideration of stereochemical information, and a robust analysis of the 3D molecular shape and electrostatic properties of ring systems. Among the key findings of this study are the facts that only about 2% of the ring systems observed in NPs are present in approved drugs but that approximately one in two NP ring systems are represented by ring systems with identical or related 3D shape and electrostatic properties in compounds that are typically used in (high-throughput) screening.
NPs are, on average, heavier and more hydrophobic than synthetic compounds explored in the context of drug discovery.4–6 They also feature a higher content of oxygen atoms and a lower content of nitrogen atoms.4,5 Most outstanding, however, is their enormous structural diversity and, in part, high molecular complexity.5–7 In particular the stereochemical properties of NPs can pose fundamental challenges to organic chemistry.
Due to the difficulties involved in the sourcing and synthesis of NPs, the availability of materials for experimental testing is limited.8 In a recent survey of more than 250000 known NPs we found that only approximately 10% are readily obtainable from commercial and non-commercial sources.9 Experimental high-throughput screening (HTS) therefore rarely is an option in NPs research. Instead, a strategy which has been applied very successfully in the search for novel, bioactive NPs is virtual screening.10 The power of virtual screening methods lies in their capacity to cherry-pick the few, most promising compounds for sourcing and testing, thereby enabling researchers to optimise the use of the limited experimental resources. Examples include the identification of influenza neuraminidase inhibitors with docking11 and shape-based approaches,12 the discovery of bioactive triterpenes from Ganoderma lucidum as farnesoid X receptor agonists with a pharmacophore-based approach,13 and the identification of a truxillic acid derivative as a selective activator of the peroxisome proliferator-activated receptor gamma with machine learning methods.14
Virtual screening can substantially reduce the time and costs involved in the identification of bioactive NPs. However, just about 5% of the approved small-molecule drugs are unmodified NPs3 and only a fraction of them are sourced directly from organisms. (Partial) synthesis hence remains an essential component of the research and production of NPs-based drugs. Consequently, the synthetic accessibility of NPs and their derivatives is a major concern.
An established strategy to address the issue of synthetic accessibility is the design of simplified NP derivatives that can be delivered by total synthesis. However, these derivatives often lack key structural features for compounds to exhibit high biological activity.15 Modern synthetic strategies aim at covering the biologically relevant NP space.15 For example, in biology-oriented synthesis (BIOS), focused compound libraries are generated that are inspired by known bioactive NPs.16 Likewise, the pharmacophore-directed retrosynthesis strategy considers the pharmacophore (or “pharmacophoric” elements) of NPs for the elaboration of retrosynthetic routes.17
The vast majority of small-molecule drugs (>90%) contain at least one ring system, regardless of whether they are of natural or synthetic origin or not.18 These ring systems form the structural core of most molecules and determine their shape and conformational flexibility, as well as the orientation of substituents.19 In consequence, they are often essential to biological activity.20 In contrast to molecular scaffolds (or frameworks), which are most commonly defined as the union of a molecule's ring systems and linkers,21 and which represent a substantial subset of atoms of molecules, ring systems present a more fine-graded concept of molecular design (in particular fragment-based design) and compound optimization.22–24 However, despite the importance of ring systems to bioactivity, there is still a lack of a comprehensive overview and understanding of NP ring systems and how their full potential can be harnessed to boost the discovery and design of new drugs. One reason for this knowledge gap is the fact that most reported studies of the physicochemical properties of NPs are focused on complete molecular structures6,25–27 or molecular scaffolds28–32 rather than ring systems.
Among the few studies emphasising on NP ring systems is the pioneering work published by Lee and Schneider in 2001.33 They analysed the ring systems present in a set of 10495 NPs and found that only 17% of these ring systems are represented in a comprehensive collection of drugs. Ertl and Schuffenhauer5 extended this type of analysis to a processed set of 113
664 unique molecules extracted from the Dictionary of Natural Products (DNP).34 One of their key messages is that NP ring systems form a highly diverse, feature-rich pool of structural templates for library and compound design. Many of the NP ring systems are of moderate complexity and have benign properties that make them promising starting points for drug discovery.
The existing studies provide valuable insights into the properties of NP ring systems but they are clearly limited with respect to the coverage of the known NP space (which is expanding quickly), and they largely disregard some key molecular properties related to stereochemistry, 3D shape and electrostatics. With this work, we aim to overcome these limitations by building on comprehensive, curated compound libraries and emphasising on 3D molecular properties. This allows us to develop a comprehensive and accurate picture of the diversity and physicochemical properties of NP ring systems, and to determine their coverage by synthetic compounds (SCs).
For the representation of the NP chemical space we referred to the Collection of Open Natural Products (COCONUT) database.36 With over 400000 listed compounds, the COCONUT database is the largest public resource of molecular information on NPs. Likewise, for the representation of the SC chemical space, we referred to the “in-stock subset” of the ZINC20 database,37 with more than 9 million readily obtainable compounds (which are typically used in virtual screening and HTS). The ZINC20 database is one of the largest resources of molecular information on purchasable compounds. Both data sets were processed in order to identify and remove any SCs contaminating the COCONUT database and any NPs included in the ZINC20 subset.
In the context of NPs research, stereochemical information is particularly important. Nevertheless, cheminformatics studies often disregard stereochemical information because it is incomplete and sometimes even wrong.38 However, the question of whether or not to consider stereochemical information can mean the need to decide between a substantial loss of molecular structures (by disregarding any molecular structures with incomplete annotations) and reduced accuracy. In this analysis, we follow a two-pronged approach, depending on the data situation and relevance of stereochemistry to the analysed properties (as indicated at the appropriate locations in the text):
• Analysis disregarding stereochemical information: this approach prioritises data quantity and comprehensiveness over accurate representation. It is primarily used for analysing properties that are not influenced by configurations (e.g. the number of heavy atoms in a ring system).
• Analysis considering stereochemical information: this approach prioritises the correct configuration of tetrahedral atoms over data quantity and comprehensiveness. It is used whenever the stereochemical information adds value and the data situation permits.
We explain the workings of these approaches by the example of different representations of usnic acid depicted in Fig. 1b: the available chemical information indicates that the molecules depicted on the left and in the middle are structurally distinct. For the molecule depicted on the right no conclusion can be drawn as to whether or not it is identical with the molecule depicted on the left or the molecule depicted in the middle (due to a lack of stereochemical annotation). The analysis considering stereochemical information exactly follows this logic whereas the analysis disregarding stereochemical information considers all three structures depicted in Fig. 1b as identical.
Ring system sets | No. unique compounds | No. unique ring systems | No. compounds/no. ring systemsa | No. singletonsb | Fraction of singletonsc | No. macrocycles | Fraction of macrocyclesc | No. chiral ring systems | Fraction of chiral ring systemsc | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a Average number of compounds containing a specific ring system. b Number of ring systems represented by only a single compound. c Among all ring systems. d A substantial number of NPs cannot be assigned to a specific subset due to a lack of annotations of the origin of NPs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
When considering stereochemical information | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Natural productsd | 269![]() |
38![]() |
6.96 | 23![]() |
0.60 | 7597 | 0.20 | 32![]() |
0.83 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Plants | 63![]() |
12![]() |
5.21 | 7772 | 0.64 | 896 | 0.07 | 10![]() |
0.83 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bacteria | 15![]() |
3094 | 5.06 | 1760 | 0.57 | 1184 | 0.38 | 1869 | 0.60 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fungi | 17![]() |
4938 | 3.63 | 3112 | 0.63 | 525 | 0.11 | 3829 | 0.78 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Marine | 35![]() |
7204 | 4.91 | 4200 | 0.58 | 1821 | 0.25 | 5552 | 0.77 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Synthetic compounds | 8![]() ![]() |
53![]() |
165.14 | 26![]() |
0.49 | 1636 | 0.03 | 33![]() |
0.62 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Approved drugs | 2238 | 602 | 3.72 | 357 | 0.59 | 52 | 0.09 | 186 | 0.31 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
When disregarding stereochemical information | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Natural productsd | 246![]() |
31![]() |
7.95 | 16![]() |
0.53 | 6619 | 0.21 | 24![]() |
0.79 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Plants | 54![]() |
8418 | 6.43 | 4851 | 0.58 | 677 | 0.08 | 6388 | 0.76 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bacteria | 14![]() |
2694 | 5.32 | 1408 | 0.52 | 1003 | 0.37 | 1471 | 0.55 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Fungi | 16![]() |
4126 | 3.94 | 2411 | 0.58 | 459 | 0.11 | 3016 | 0.73 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Marine | 32![]() |
6195 | 5.25 | 3431 | 0.55 | 1575 | 0.25 | 4493 | 0.73 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Synthetic compounds | 6![]() ![]() |
30![]() |
207.41 | 15![]() |
0.51 | 1388 | 0.05 | 10![]() |
0.33 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Approved drugs | 2225 | 596 | 3.73 | 351 | 0.59 | 52 | 0.09 | 180 | 0.30 |
With a total of 63658 compounds, the subset of plant NPs is approximately half the size of the subset of plant NPs included in the DNP (the most comprehensive, commercial database of its kind). The lower number of plant NPs results from the fact that some of the largest, non-commercial databases relevant to plant NPs (such as the Universal Natural Products Database (UNPD)40 with more than 220k NPs, and the Traditional Chinese Medicine (TCM) Database@Taiwan41 with more than 60k TCM-related NPs) lack annotations that would allow to identify them as such. For this reason the non-labelled NPs could not be considered in this specific part of the analysis.
The number of unique ring systems in NPs is 38662. Across the subset of NPs from plants (with 63
658 compounds the largest NP subset) we count 12
217 unique ring systems (see Table 1 for additional information). For comparison, the numbers of unique ring systems in SCs and approved drugs are 53
229 and 602, respectively. This means that a unique ring system represents, on average, approximately 7 NPs, 165 SCs and 4 approved drugs. Considering the fact that the set of SCs is approximately 32-fold larger than the set of NPs, this corroborates the remarkable diversity of NPs.
When disregarding stereochemical information, the numbers of unique ring systems observed in NPs and SCs are reduced to 31003 (−20%) and 30
265 (−43%), respectively (Table 1). The substantially larger decrease in SCs is related to the fact that NPs result primarily from stereoselective biochemical synthesis while synthetic compounds are products primarily of non-selective chemical reactions (hence, the levels of stereochemical annotations differ accordingly).
From rank 7 on, the lists of ring systems start to differ, depending on whether or not stereochemical information is considered. For example, ranked no. 7, when disregarding stereochemical information, is the scaffold common to pentacyclic triterpenes (such as ursolic acid), representing 2.05% of all NPs. In contrast, when considering stereochemical information, the first representation of a pentacyclic triterpene is placed on rank 22 only (representing 0.63% of all NPs). This discrepancy stems from the fact that for this scaffold (counting seven tetrahedral atoms) a total of 105 stereoisomers are recorded among the NPs (ESI Fig. S2† shows 30 of these stereoisomers).
Among the 30 most common NP ring systems, 17 ring systems (when considering stereochemical information; 15 when disregarding this information) contain at least one oxygen atom and 6 (in either approach) contain at least one nitrogen atom. Furthermore, in both approaches, 11 out of the 30 top-ranked ring systems (∼37%) are aromatic and further 5 ring systems contain at least one aromatic ring.
Because of the lower proportion of ring systems with chiral centres among SC ring systems compared to NP ring systems (Table 1 and Fig. 6d), the rankings of the most common ring systems are largely unaffected by the consideration or disregard of stereochemical information (Fig. 2b). For this reason we limit our discussion in the following paragraphs to the approach considering stereochemical information.
Like for the NP ring systems, rank 1 of the SC ring systems is held by benzene, although at a much higher percentage of occurrence in molecules (74% vs. 29%). Again, the decline in the frequencies of the individual ring systems is steep (ESI Fig. S1b†). An abundance of nitrogen-containing ring systems (70% vs. 20% among the 30 top-ranked NP ring systems, Table 2), mostly of aromatic character (70% vs. 53% of the ring systems contain at least one aromatic ring), is apparent. Besides benzene, twelve of the most common ring systems in SCs are also common across the 30 top-ranked NP ring systems: tetrahydrofuran, cyclohexane, furan, cyclopentane, 1H-indole, pyridine, 1H-pyrroline, imidazole, 1,3-benzodioxole, naphthalene, pyrrolidine and piperidine. In contrast, sulphur-containing ring systems are observed only among the 30 top-ranked ring systems from SCs and not from NPs.
Features | NP ring systems | SC ring systems | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Top-30a | Top-100a | Alla | Top-30a | Top-100a | Alla | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
a Considering stereochemical information (disregarding stereochemical information). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Only carbon atoms | 0.23 (0.30) | 0.28 (0.29) | 0.08 (0.07) | 0.17 (0.17) | 0.12 (0.11) | 0.08 (0.05) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Only carbon and oxygen atoms | 0.80 (0.80) | 0.75 (0.75) | 0.65 (0.62) | 0.27 (0.23) | 0.22 (0.21) | 0.31 (0.20) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
At least one oxygen atom | 0.57 (0.50) | 0.55 (0.52) | 0.84 (0.85) | 0.27 (0.27) | 0.46 (0.44) | 0.68 (0.66) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Only carbon and nitrogen atoms | 0.43 (0.50) | 0.43 (0.46) | 0.15 (0.14) | 0.60 (0.60) | 0.44 (0.46) | 0.25 (0.24) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
At least one nitrogen atom | 0.20 (0.20) | 0.24 (0.24) | 0.34 (0.36) | 0.70 (0.73) | 0.74 (0.75) | 0.64 (0.75) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
At least one sulfur atom | 0.00 (0.00) | 0.02 (0.02) | 0.05 (0.06) | 0.13 (0.17) | 0.17 (0.17) | 0.18 (0.23) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Aromatic atoms | 0.53 (0.53) | 0.49 (0.41) | 0.37 (0.41) | 0.70 (0.70) | 0.64 (0.65) | 0.49 (0.62) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Heteroaromatic atoms | 0.30 (0.30) | 0.28 (0.23) | 0.15 (0.17) | 0.60 (0.60) | 0.50 (0.53) | 0.28 (0.41) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
At least one stereo cent (atom tetrahedral) | 0.07 (0.13) | 0.13 (0.30) | 0.83 (0.79) | 0.00 (0.00) | 0.00 (0.00) | 0.62 (0.33) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
At least one non-aromatic C![]() |
0.17 (0.17) | 0.29 (0.38) | 0.64 (0.63) | 0.00 (0.03) | 0.13 (0.13) | 0.32 (0.26) |
For NPs and SCs alike, a high percentage of ring systems (approximately 50% and 60%, depending on whether or not stereochemical information is considered; see Table 1) are singletons, meaning that they are recorded in a single compound only. Among the approved drugs, the percentage of singletons is close to that calculated for NPs (59%, with and without the consideration of stereochemical information). After removal of singletons we visualised and analysed every 500th ring system, as well as the 30 most distinct ring systems, based on k-means clustering (ESI Fig. S3 and S4†). These additional analyses revealed a higher proportion of macrocyclic structures among the NP ring systems than among the SC ring systems (ESI Fig. S3 and S4,†Table 1 and Section 2.1.5). The fused polycyclic steroid- and terpene-derived ring systems observed in the SC data set are likely related to the semi-synthetic origin of some of the compounds.
Overall, the set of most common SC ring systems is clearly of lower complexity and diversity than the respective set of NP ring systems. Additional statistics on the composition of NP and SC ring systems are reported in Table 2.
There are 271 ring systems that are represented in NPs from all three investigated kingdoms (plants, bacteria and fungi) and marine life (Fig. 3). Thirty-five of these ring systems are observed in at least 20 NPs of each of these subsets (ESI Fig. S5†). These 35 ring systems are structurally non-complex and include (i) benzene and common 5-membered and 6-membered aromatic and saturated heterocycles (e.g. tetrahydrofuran and furan, pyrrole and pyridine), (ii) bicyclic ring systems such as benzoquinone and indole, and (iii) the tricyclic ring systems anthraquinone and carbazoline. Overall, an accumulation of cyclic ketones is observed, which points to the importance of the polyketide pathway to secondary metabolism across species from different kingdoms. Furthermore, the presence of alkaloids highlights the relevance of amino acid metabolism. The proportions of ring systems exclusive to the individual kingdoms (and marine life) are high: 78.70% (6625) for plants, 55.35% (1491) for bacteria, 52.81% (2179) for fungi and 54.97% (3406) for marine species.
The 30 most frequent ring systems observed exclusively in NPs are shown in Fig. 4a (stereochemistry considered). They represent 114 to 506 NPs each. A high degree of diversity is observed among these ring systems, although the ring systems ranked 9 and 20 differ only by their stereochemical configuration and those ranked 12 and 29 differ by only one atom.
Principal component analysis (PCA) of the 14 physicochemical properties confirms the superior chemical diversity of the ring systems present in NPs in comparison to those present in SCs (Fig. 5; see the ESI† Method section for the exact protocol). PC1 is dominated by molecular weight and the number of heavy atoms: larger ring systems are located towards the right of the plot, and most of them originate from NPs. PC2 primarily describes the polarity of the ring systems: more polar ring systems are located towards the upper part of the plot. The NP ring systems populate a wider chemical space than those derived from SCs. The area most densely populated with NP and SC ring systems alike is also the one that is of primary relevance to small-molecule drug discovery. It envelopes all of the 30 most frequent ring systems observed in NPs.
The diversity of the compounds and the trends observed by PCA are reflected by the distributions of the individual physicochemical properties (Fig. 6): the ring systems extracted from NPs are more diverse than those originating from SCs with respect to their size (represented by their molecular weight (MW) and the number of heavy atoms (a_heavy); Figs. 6a and g), the number of rings (Fig. 6l), and the number of oxygen atoms (Fig. 6e). In contrast, SCs show a wider distribution of the number of nitrogen atoms (Fig. 6f) and the hydrogen bond donors (Fig. 6i).
Rings are more frequent in NPs than they are in SCs: on average, NPs contain 4.65 rings whereas SCs contain only 2.98 (Fig. 6l). Moreover, NP ring systems are, on average, larger (heavier) than SC ring systems (MW 302 Da vs. 193 Da, Fig. 6a; a_heavy 22 vs. 14, Fig. 6g) and more lipophilic (logP 3.26 vs. 1.57, Fig. 6b). NP ring systems are often also more complex with regard to their 3D molecular shape: they have a higher fraction of sp3 hybridised carbon atoms (0.57 vs. 0.39; Fig. 6c) and a higher number of bridgehead atoms (1.59 vs. 0.42; Fig. 6k). Related to these characteristics, NP ring systems are more likely to contain chiral tetrahedral atoms (76% in ring systems derived from plant NPs; 55% derived from bacterial NPs; 73% derived from fungi NPs; 73% derived from marine life NPs; Table 1; when disregarding stereochemical information) compared to both SCs (33%) and approved drugs (30%). Also, the average number of chiral centres in ring systems from NPs is 3.75 compared to 0.94 for ring systems from SCs (Fig. 6d).
NP ring systems further stand out by a higher average number of oxygen atoms (2.43 vs. 1.21; Fig. 6e), a lower average number of nitrogen atoms (0.87 vs. 1.74; Fig. 6f), and a lower degree of aromaticity (i.e. average number of aromatic ring atoms 4.42 vs. 6.03; Fig. 6j) compared to SC ring systems. With respect to the topological polar surface area (TPSA) and formal charges, the ring systems observed in NPs and SCs are comparable (Fig. 6m and n). Among the NP ring systems, 99.8% have a formal charge of 0 or 1. Among the SC ring systems, this percentage is 99.3.
In general, the trends observed for the property distributions among ring systems are consistent with those observed for the complete molecules.4,5 One exception is lipophilicity, which is, on average, higher for NP than for SC ring systems (logP 3.26 vs. 1.57) but comparable for the complete molecules (log
P 3.25 vs. 3.31; data from ref. 4).
Approximately one out of two NP ring systems (∼13500) is matched by a ring system of the SC data set with ET_combo scores of at least 1.60, a threshold above which ring systems can be typically considered as structurally closely related or “covered” (Fig. 7; data for alternative ET_combo score thresholds is provided in ESI Table S2†). This means that roughly half of the recorded NP ring systems are accessible to synthetic organic chemistry. It also means, however, that there is another half, in other words a large number of NP ring systems, that is clearly still underexplored as potentially relevant structural templates for drug discovery.
In this comprehensive analysis of the ring systems present in NPs we show that about one in two NP ring systems (∼13500) are represented by ring systems with identical or related 3D shape and electrostatic properties in readily obtainable, synthetic compounds (which are typically used for virtual screening and HTS). At the same time, only about 2% of the 38
662 unique ring systems observed in NPs (stereochemistry considered) are components of approved drugs, leaving a huge pool of potentially relevant ring systems yet to be exploited in small-molecule drug discovery. One particular area of interest are macrocycles, which are represented by significant numbers among NPs (7597 unique macrocycles, representing 20% of the total number of unique ring systems; stereochemistry considered) but not among SCs (1636 unique macrocycles, representing 3%).
While it will take time for synthetic and biotechnological approaches to advance the exploration of NP ring systems, the full wealth of information on the existing ring systems can be exploited already today, by rapidly advancing artificial intelligence (AI) technologies for compound design.7,50–53 These technologies utilise various types of biological, chemical and structural information to train models that can generate NP-inspired compounds which have a high likelihood of being synthetically accessible and active on the target(s) of interest. The most promising way forward certainly is the integration of these in silico approaches with the advanced experimental techniques that are at our disposal today. The synergy generated from this effort will boost NPs research and small-molecule drug discovery.
Footnote |
† Electronic supplementary information (ESI) available: Details on the computational methods and how to access the source code; Table S1, reporting the full names of the data sources of the COCONUT database; Table S2, reporting the numbers and percentages of NP ring systems that are matched by a ring system in the SC data set at different cutoffs of the ET_combo score. Fig. S1, showing the occurrences (in percent) of the 30 most frequent ring systems in (a) NPs and (b) SCs when considering stereochemical information; Fig. S2, showing the 30 most frequent stereoisomers of the pentacyclic triterpene ranked no. 7 of the NP ring system set when disregarding stereochemical information; Fig. S3, showing every 500th (a) NP ring system and (b) SC ring system (stereochemical information considered; singletons omitted); Fig. S4, showing the 30 most diverse (a) NP ring systems and (b) SC ring systems (identified by a k-means clustering method implemented using scikit-learn and RDKit that takes Morgan2 fingerprints with a length of 1024 bits as input; singletons removed prior to clustering); Fig. S5, showing the 35 ring systems recorded for at least 20 times in each of the subsets of NPs from plants, bacteria, fungi and marine life. See https://doi.org/10.1039/d2np00001f |
This journal is © The Royal Society of Chemistry 2022 |