Nadja B.
Cech
*a,
Marnix H.
Medema
*b and
Jon
Clardy
*c
aChemistry, University of North Carolina Greensboro, USA. E-mail: nadja_cech@uncg.edu
bBioinformatics, Wageningen University, The Netherlands. E-mail: marnix.medema@wur.nl
cBiological Chemistry and Molecular Pharmacology, Harvard Medical School, USA. E-mail: jon_clardy@hms.harvard.edu
First published on 4th November 2021
Systematic, large-scale, studies at the genomic, metabolomic, and functional level have transformed the natural product sciences. Improvements in technology and reduction in cost for obtaining spectroscopic, chromatographic, and genomic data coupled with the creation of readily accessible curated and functionally annotated data sets have altered the practices of virtually all natural product research laboratories. Gone are the days when the natural products researchers were expected to devote themselves exclusively to the isolation, purification, and structure elucidation of small molecules. We now also engage with big data in taxonomic, genomic, proteomic, and/or metabolomic collections, and use these data to generate and test hypotheses. While the oft stated aim for the use of large-scale -omics data in the natural products sciences is to achieve a rapid increase in the rate of discovery of new drugs, this has not yet come to pass. At the same time, new technologies have provided unexpected opportunities for natural products chemists to ask and answer new and different questions. With this viewpoint, we discuss the evolution of big data as a part of natural products research and provide a few examples of how discoveries have been enabled by access to big data. We also draw attention to some of the limitations in our existing engagement with large datasets and consider what would be necessary to overcome them.
Another pivotal moment in the evolution of the natural products field came when the Hopwood lab at the John Innes Centre reported that the genes encoding the enzymes responsible for the production of a natural product by Streptomyces coelicolor were clustered on a stretch of DNA.4 This recognition led to our current ability to detect biosynthetic gene clusters (BGCs) in microbial (and other) genomes, and to parse these clusters to describe the molecules they produce. Most importantly, the development of tools to detect gene clusters opened up the possibility of interrogating pre-existing genomic data to probe for compounds of interest. The ability to harness genomics for natural products research was further fueled by the large-scale sequencing of genomes and later metagenomes, advances that were possible thanks to drastic reductions in DNA sequencing costs and increased capacity for computer storage. Technological innovations led to sociological adjustments as well. What used to be isolated natural products research efforts began to involve team and community contributions. Natural products scientists no longer devoted themselves exclusively to isolation and structure elucidation. They also undertook efforts to curate and maintain databases and to improve the tools for analyzing them.
Changes in the magnitude and type of data available to researchers in the natural products field are reflected in what we view as natural products research today. In search of relevant natural products, we routinely interrogate entire genomes or metagenomes and complex mixtures of proteins (proteomes) or small molecule metabolites (metabolomes). Increasingly, the data that support these projects do not reside entirely in a single laboratory but are shared in public and community-supported databases. The consequences of this shift, which we refer to here as the big data revolution in natural products, forms the basis of this themed issue of Natural Product Reports.
Reading the articles included in this themed issue, it is possible to imagine a future for natural products that is increasingly collaborative, leveraging the collective intelligence, skillsets, perspectives, and, importantly, data of scientists around the world. In idealized natural products research projects of the future, organisms of interest would be selected not only based on serendipity and accessibility of a given organism, but by comparing genetic, spectroscopic, or functional data from curated databases (Chevrette et al., DOI: 10.1039/D1NP00013F; Bauman et al., DOI: 10.1039/D1NP00032B; Chevrette and Handelsman, DOI: 10.1039/D1NP00044F). Computational algorithms trained on data from such databases would be used to predict the structures of the secondary metabolites produced by the organisms of interest (Caesar et al., DOI: 10.1039/D1NP00036E),5 the families to which these compounds belong (van Santen et al., DOI: 10.1039/D0NP00053A), and even their biological activity (Jeon et al., DOI: 10.1039/D1NP00016K). Larger scale, better integrated and higher quality datasets of the gene sequences, protein sequences and small molecule structures associated with living organisms (Bauman et al., DOI: 10.1039/D1NP00032B) would empower future research using artificial intelligence as a powerful new discovery tool (Jeon et al., DOI: 10.1039/D1NP00016K). An important caveat to these optimistic scenarios is the eventual necessity of actually producing the molecules whose existence and properties can be inferred so that their structures and functions can be experimentally verified. Thus, it is critical that in our pursuit of new and exciting technologies, we do not lose sight of the need to train the younger generation of natural products scientists in the skills necessary to isolate and solve natural product structures.
The reviews collected in this themed issue describe rapid progress that is being made on many fronts, all of which promise to contribute to a more integrated, collaborative, and efficient future for natural products research. These reviews also highlight many critical barriers that still exist to fully leveraging big data for successful natural products research. Spectroscopic data, structural data, and genomic data are distributed across many databases, none of which are fully comprehensive. Many databases are not curated and may include erroneous information, such as incorrectly annotated gene clusters, incorrect structures, or errors in annotation of spectral data. Also, databases containing gene and protein sequences and chemical structures are, as of yet, not optimally integrated with each other. Meta-data is lacking. Much of the data that has been collected thus far is either proprietary or buried in the scientific literature in a format that is not easily searchable. Significant resources in terms of time and money are needed to address all of these issues, and the work required is, to quote van Santen et al. (DOI: 10.1039/D0NP00053A), “unglamorous.” Finally, the understandable desire on the part of many researchers to protect the intellectual property associated with their data hinders efforts to make those data more freely available.
What incentives exist to overcome these barriers? The big data revolution has not just changed how we do natural products research, it has also changed what we define as natural products research, creating opportunities to explore new questions and to interface in new ways with scientists in related fields. In the coming years, we expect that the impact of big data will continue to be felt across the community of scientists doing natural products research, and that creative solutions will be developed to address the most pressing hurdles currently hampering such progress. Here we discuss some of these specific hurdles and provide a few examples of how big data has been effectively leveraged despite them.
Our field is likely nearing a tipping point where researchers will begin to rely more on open-access databases and less on historical subscription-only compendia of natural product structures such as MarinLit, AntiBase, and the Dictionary of Natural Products. At the present time, however, the open access databases are still not comprehensive enough to entirely replace the subscription-based sources, and the existence of multiple different platforms for storing chemical structures, each with different coverage and linked to different types of information, creates a great deal of confusion among researchers. We are still some way off from the future vision of a single, comprehensive database or systematic cross-linking and integration of existing databases. There is no question, however, that the existence of such resources would be of tremendous benefit to the continued success of natural products research endeavors.
Variability in MS data across platforms also occurs due to differences in the type of clusters, fragments, and adducts produced by different electrospray source designs and configurations. This variability in ionization behavior makes it difficult to compare LC-MS datasets between laboratories.12 Furthermore, there is a tendency to overestimate the complexity of metabolomics datasets because each individual analyte gives rise to more than one signal.13 The complexity of MS datasets is further increased due to interference from chemical species that are present as contaminates in the system, including the solvents, the column, the plumbing, and even the laboratory atmosphere.14,15 The complexity of mass spectrometry datasets becomes particularly challenging in untargeted metabolomics experiments, where the goal is often to track or annotate all analytes present in each mixture. Thus, there is a need for effective approaches to reduce the complexity of such datasets, either by grouping the signals associated with single analytes,16 and/or removing irrelevant signals that arise from chemical interference.14
It is exciting to observe that an increasing number of researchers are now uploading LC-MS datafiles to accompany their papers in servers such as GNPS-massIVE.17 Poor annotation of data files, poor data quality (i.e. noise in the data or lack of appropriate blanks, QC samples, and/or replicates), uniqueness of the data to the platform on which they were collected, and lack of associated metadata often limits the value of these data to researchers in other laboratories. The establishment and adoption of best practices for collecting, processing, and sharing metabolomics data for natural products would help to address some of these limitations.
Despite the associated challenges, we are beginning to see research projects that query publicly available mass spectrometry data across laboratories to answer scientifically interesting questions (see Jarmusch et al., DOI: 10.1039/D1NP00040C). As a recent example, Jarmusch et al. developed a tool (called ReDU)18 that enables comparison of the shared and different chemicals between groups of samples, and makes it possible to conduct repository-scale molecular networking. Using this tool, they profiled the distribution of 12-ketodeoxycholic acid, cholic acid, and rosuvastatin by mining more than five thousand different data files for human fecal material across the life cycle. Their results provide insight into how the type of microbes in the gut microbiome change as humans age.
But even the primary data cannot always be trusted. For pragmatic reasons, volume is often preferred above quality, especially for targeted ‘screening’ approaches. Therefore, many highly fragmented draft genomes are found in the databases and many genomes contain misassembled BGCs. These may easily give false impressions of biosynthetic diversity that is not truly there and may lead to faulty hypotheses being generated. Deposition of the raw data should therefore be more strongly encouraged (or even demanded), as seemingly interesting variations can then be reassessed by, e.g., repeating the assembly before investing in expensive and time-consuming experiments.
The sizes of omics datasets used as the basis for natural product discovery have increased by multiple orders of magnitude over the past decade. Whereas ten years ago, using 10–20 genome sequences as the basis for a natural product genome mining project was still revolutionary, many thousands are often used these days. A case in point is the effort by Warp Drive Bio to identify new rapamycin analogues in a collection of ∼135000 actinobacterial draft genomes.20 The rationale for this search stemmed from the fact that rapamycin and the related metabolite FK506 were known to bind two different targets. In both cases, binding is mediated by a conserved structural moiety that binds the FKBP12 protein, which then helps binding to the target through protein–protein interactions. The authors hypothesized that, within this class of polyketides, many additional biosynthetic pathways might have evolved to bind a range of other protein targets with the aid of FKBP12. They therefore performed low-coverage sequencing of thousands of actinobacterial genomes to scan for the presence of the lysine cyclodeaminase gene, which is involved in the biosynthesis of pipecolate, a key structural component of the part of both FK506 and rapamycin that binds the FKBP12 active site. Although these genome assemblies were undoubtedly noisy and will not have contained many full-length BGCs for the production of rapamycin analogues, this allowed the effective prioritization of strains that might contain them. All strains with hits to this gene were subjected to complete genome sequencing and assembly to reveal the presence of BGCs potentially encoding the production of new rapamycin analogues. In the end, the team found five BGCs with novel architectures, and were able to identify a new natural product that targets human centrosomal protein 250 (CEP250), a protein that had been thought to be ‘undruggable’ due to its flat surface. While only finding one new BGC of this type in every ∼30000 genomes may be perceived as disappointing, the study did show that targeted screening can be used to find needles in a big data haystack that may be useful starting points for drug discovery. In a similar manner, metagenomic screening efforts (discussed in-depth by Robinson et al., DOI: 10.1039/D1NP00006C) have been used to unearth new calcium-dependent lipopeptide antibiotics.21,22
Untargeted approaches constitute another way in which genomic big data can be utilized, e.g. to chart extant natural product diversity and guide discovery efforts to the most promising taxa and BGCs to avoid rediscovery and target natural products with relevant activities (also discussed in Chevrette and Handelsman, DOI: 10.1039/D1NP00044F). For example, an algorithm called BiG-SLiCE was recently used to analyze global biosynthetic diversity of ∼1.2 million BGCs across >200000 microbial genomes.19 Converting BGC sequences into vectors of numerical features made it computationally feasible to identify relationships between gene clusters at a global scale, and to rapidly assign any given query BGC to a gene cluster family containing its known and unknown relatives. Moreover, this technology also enabled a study to quantitatively assess natural product biosynthetic diversity across the tree of life,23 which suggests that only ∼3% of genomically encoded natural product classes have been discovered thus far, and that highly studied taxa like Streptomyces still harbor many yet-unknown natural products. Notably, this is not necessarily a guarantee that mining the data for these unknown natural products will also yield many new drugs like urgently needed antibiotics, as these might already have been oversampled by genome-independent discovery approaches in previous decades that primarily screened for biological activities of interest in culture extracts.24
As we look forward to the advances that may be enabled by the big data revolution, it is tempting to discuss them in opposition to the technologies that characterized our past. Indeed, it is common to hear the leading researchers in our field dismiss projects that rely on bioassay-guided fractionation out of hand, speaking in disparaging terms about ‘grind and find’ science. We contend that it is worthwhile to view the foundational skills of natural products discovery – isolation and purification – not in opposition to big data approaches, but as complementary to them. Chemists skilled at isolation and structure elucidation will always be a valuable part of the natural products research team, because all predictions need to be validated by ground truth. Any use of machine learning algorithms to utilize large datasets depends strongly on reliable training data on chemical structures, enzyme functions and biological activities, which, regardless of exciting technological developments, still must be produced the hard way. Moreover, big data driven approaches are by nature hypothesis generating, and studies with purified material are critical to validate predictions about structure and activity. It is humbling to note that while this themed issue is filled with numerous tantalizing vignettes about what is currently possible thanks to big data or what may be possible in the future, we cannot yet point to a single drug that has been discovered using exclusively big data approaches. Perhaps this is simply because the contributions of big data to the field of natural products drug discovery are too young to pan out in concrete ways. It is also possible that finding a clinically useful new drug is too high a bar to set. If that is indeed the case, the question remains, what should be the litmus test for a truly successful natural products research project? If the goal is not drug discovery, we may need to rethink the narrative we use to sell the value of our research endeavors.
The optimists among us believe that by engaging with big data we are developing the tools today that will enable discovery of the drugs of tomorrow. It may also be true that these tools do not live up to our hopes for drug discovery, but by adopting them that we are shifting the focus of our field in new (and perhaps even more exciting) directions. Some would say that this is already happening. Regardless, it is obvious that big data approaches have irrevocably altered the landscape for natural products researchers, and that we will continue to engage with big data in the future. We expect that such engagements will deepen our understanding of life on our planet, and we hope that the fruits of these labors will increasingly be shared equitably to improve the quality of life for those who inhabit it.
This journal is © The Royal Society of Chemistry 2021 |