Hosein
Mohimani
and
Pavel A.
Pevzner
*
Department of Computer Science and Engineering, University of California, San Diego, USA. E-mail: ppevzner@ucsd.edu
First published on 26th October 2015
Covering: 2000 to 2015
While recent breakthroughs in the discovery of peptide antibiotics and other Peptidic Natural Products (PNPs) raise a challenge for developing new algorithms for their analyses, the computational technologies for high-throughput PNP discovery are still lacking. We discuss the computational bottlenecks in analyzing PNPs and review recent advances in genome mining, peptidogenomics, and spectral networks that are now enabling the discovery of new PNPs via mass spectrometry. We further describe the connections between these advances and the new generation of software tools for PNP dereplication, de novo sequencing, and identification.
While the first computational methods for analyzing mass spectra of small molecules were developed in the 1960s,5–8 three decades before their proteomics counterparts,9,10 computational mass spectrometry of small molecules is often viewed as a more complex (and less mature!) field as compared to computational proteomics.11,12 See ref. 12–16 for recent reviews of computational approaches to analyzing small molecules. Depending on their building blocks, natural products are classified into a variety of chemical classes that include Peptidic Natural Products (PNPs), the focus of this review. Starting from penicillin, PNPs have an unparalleled track record in pharmacology: many antibiotics, antiviral and antitumor agents, immunosuppressors, and toxins are PNPs.
While recent breakthroughs in PNP discovery2,17,18 raised a challenge for developing new algorithms for dereplication, de novo sequencing and identification of PNPs, computational technologies for high-throughput PNP discovery are still lacking. The traditional process of PNP discovery is to elucidate the structure of the compound by chemical assays (such as nuclear magnetic resonance) and to associate the chemical compound to its biosynthetic gene cluster by genome manipulations. This process is time-intensive, laborious, and requires large amounts of highly purified material. Moreover, rather than discovering novel PNPs, it often rediscovers known PNPs, resulting in wasted efforts.
Recently, mass spectrometry (MS) has become a cheap, fast, and reliable complementary approach to traditional PNP discovery techniques.19,20 However, compared to traditional applications of MS in proteomics, the application of MS for PNP discovery faces additional computational challenges due to a higher complexity of the compounds and unusual fragmentation patterns. Some of these challenges are now addressed through genome mining, peptidogenomics, and spectral networks:
• Genome mining. The sequencing of many bacterial and fungal genomes in the last decade opened an era of genome mining for PNP discovery. Genome mining refers to using information about the biosynthetic genes (responsible for synthesizing a PNP) to infer information about the PNP itself. The discovery of coelichelin in Streptomyces coelicolor was one of the first successes of genome mining21,22 that was followed by the characterization of many PNPs from sequenced genomes.
• Peptidogenomics. Given a mass spectrum and a peptide database, peptide identification refers to finding a peptide in the database (or its variant) that generated the given spectrum. While peptide databases in traditional proteomics consist of known peptides, peptide databases in peptidogenomics are often dominated by putative peptides derived via genome mining. Since many PNPs are not directly encoded in genomes, genome mining often fails to generate a database of putative PNPs that contains the exact amino acid sequence of a PNP corresponding to a given spectrum. Instead, it produces a database containing an error-prone template that makes matching spectra against such a template difficult. Therefore, popular proteomics tools such as Sequest9 and Mascot10 fail to identify PNPs. Also, identification of spectra derived from PNPs is more difficult than traditional peptide identification in proteomics because many PNPs are non-linear peptides with extensive modifications that generate complex spectra (the standard proteomics tools fail to identify non-linear peptides).
• Spectral networks. Bandeira et al.23 introduced the concept of spectral networks (also known as molecular networks18) that reveal spectra of related compounds (without knowing what these compounds are) using spectral alignment algorithms.24,25 Nodes in the spectral networks correspond to spectra while edges connect spectral pairs, i.e., pairs of spectra that are generated from related peptides (e.g., peptides differing by a single mutation or a modification). Spectral networks enable the discovery of novel variants of known PNPs as well as novel PNP families. Thus, since most PNPs form families of related peptides,18 spectral networks are ideally suited for analyzing PNPs.
PNPs are produced by two types of biosynthetic machineries: Non-Ribosomal Peptide synthetase (NRP synthetase)26,27 and Ribosomally synthesized and Posttranslationally modified Peptide synthetase (RiPP synthetase).28,29 NRP and RiPP synthetases produce Non-Ribosomal Peptides (NRPs) and Ribosomally synthesized and Posttranslationally modified Peptides (RiPPs), respectively. NRPs are widely distributed and biomedically important natural products that are not directly inscribed in genomes but instead are encoded by NRP synthetases using the non-ribosomal code.30 In addition to standard amino acids, NRPs often include non-proteinogenic amino acids such as ornithine. Known NRPs include hundreds of non-proteinogenic building blocks and some NRPs like kutznerides31 are built entirely from non-proteinogenic amino acids. Since the non-ribosomal code remains poorly understood, an accurate prediction of PNPs from NRP synthetases remains challenging.
While RiPPs are encoded in the genome, the genes encoding RiPPs are often short making it difficult to annotate them (short genes often evade gene prediction methods32,33). Moreover, RiPPs often have many unusual Post Translational Modifications (PTMs) making it difficult to identify them via MS. Heavily modified peptides with more than two blind modifications often evade identification algorithms such as InsPecT25 and MODa34 designed for discovery of unexpected PTMs.
Analysis of over 1000 bacterial genomes from the Joint Genome Institute (JGI) database revealed that 71% of them harbor at least one RiPP protein family (Pfam) domain and 69% harbor at least one NRP synthetase Pfam domain.17 Recent analysis of 830 Actinobacteria genomes revealed that Actinobacteria encode thousands of potential drug leads.35 These and other studies18 suggest that we only saw a tip of the iceberg with respect to PNP discovery and raise the challenge of developing new methods for PNP discovery.
Understanding how PNP biosynthetic machineries work is a prerequisite to genome mining and peptidogenomics that involves two steps; predicting the candidate gene clusters responsible for the synthesis of a PNP and connecting them to their chemical products by MS. However, connecting biosynthetic gene clusters to their products is not a trivial task since the rules defining how a gene cluster specifies its products remain poorly understood. For example, the existing tools for predicting NRPs from NRP synthetases remain error-prone. The transition from a gene cluster to its product becomes particularly difficult in the case of modifications involved in the maturation of PNPs. For example, the gene cluster for coelichelin (NRP synthetase) was elucidated in 2000,21 but coelichelin itself (NRP) was sequenced only in 2005.22
Below we review recent advances in genome mining, peptidogenomics, and spectral networks (Section 2) and further describe PNP dereplication (Section 3), PNP sequencing (Section 4), and PNP identification (Section 5). We note that, as opposed to dereplication (that reveals known PNPs or their variants), PNP sequencing and identification may reveal previously unknown PNPs. Fig. 1 illustrates various approaches to PNP discovery.
Fig. 2 (a) Predicting NRPs based on NRP synthetase analysis using tools such as NRPSpredictor2 (ref. 39) and antiSMASH.40 The following domains are shown: A-domains (red), PCP-domains (green), C-domains (blue), methylation domains (yellow), and thioester domains (purple). Note that different modules of the same NRP synthetase can appear in different frames. (b) Extracting signature sequences (non-ribosomal code) from A-domains (only short segments of the A-domains are shown). Various A-domains have conserved residues (shown in red) that enable their accurate multiple alignment using profile HMMs. The non-ribosomal code postulates that certain amino acids in the resulting multiple alignment (shown in purple) define a single amino acid in the NRP loaded by this domain. The three A-domains shown here define 8 amino acid signatures LTKVGHIG, VGEIGSID, and WMFAAVL corresponding to the amino acids Asp, Orn, and Val, respectively. The 8 amino acid signatures shown here represent a simplified representation of the non-ribosomal code, e.g., NRPSpredictor2 uses longer signatures to predict amino acids for each A-domain. |
NRP synthetases are formed by an array of distinct modular sections, each of which is responsible for incorporation or modification of a single amino acid into the final NRP. A minimum of three domains are required for each NRP synthetase module, the adenylation domain (A-domain), peptidyl carrier domain (PCP-domain) and condensation domain (C-domain). The A-domain is responsible for picking the specific amino acids that will be incorporated into the NRP. Hundreds of different A-domain specificities have been classified, each one recruiting a specific amino acid. This allows us to determine the sequence of the putative NRP by looking at the order of A-domains along the assembly line and assigning a specific amino acid to each A-domain using the non-ribosomal code. However, since the non-ribosomal code is still poorly understood, the tools for defining specificities of A-domains remain error-prone. These tools often use profile Hidden Markov Models (HMMs) to align conservative amino acids within each A-domain (red amino acids in Fig. 2(b)) against previously analyzed A-domains. The constructed alignment reveals variable amino acids within A-domains (purple amino acids in Fig. 2(b)) that define the non-ribosomal code. The genome mining tools further use various machine learning techniques to derive the amino acid in the NRP defined by the non-ribosomal code.
RiPPs are classified into more than 20 classes (such as lanthibiotics, thiopeptides, cyanobactins, lasso peptides, and many others) based on structural and biosynthetic commonality.49 Various software tools for RiPP genome mining have been reviewed in ref. 50. BAGEL, a genome mining tool for bacteriocins, revealed 150 putative lanthipeptide gene clusters.51,52 ThioFinder, a genome mining tool for thiopeptides, predicted 53 novel thiopeptide producing gene clusters.53 Recent genome mining studies predicted 79 lasso peptides54 and 27 cyanobactin-producing Anabaena strains.55 The development of RiPP genome mining tools is tied to construction of databases of known RiPPs, such as Bactibase, a database of 177 bacteriocins56 or a bacteriocin database51,52 consisting of 483 bacteriocins (236 class I, 160 class II and 93 class III as of August 2015). Other examples include Thiobase, a database of 39 thiopeptides,53 and MIBiG, a natural product structure and biosynthetic gene cluster repository with over 169 RiPPs from different classes.42 Availability of these databases for diverse RiPP classes speeds up the development of novel machine learning techniques aimed at genome mining for RiPPs.50
AntiSMASH is one of the most popular genome mining tools for analyzing both NRPs and RiPPs as well as polyketides. The antiSMASH pipeline includes the following steps: (i) genes are extracted or predicted from the genome using Glimmer3,32 (ii) biosynthetic gene clusters are identified using profile HMMs, (iii) biosynthetic gene clusters are annotated, and (iv) the core chemical structure of natural products are predicted based on the annotated gene clusters. Optionally, comparative analysis of the biosynthetic gene clusters can be done using ClusterBlast.40
Kersten et al., 2011 (ref. 17) discovered many novel PNPs using a manual peptidogenomic approach for connecting PNPs to their biosynthetic genes and matching them against mass spectra. However, the manual peptidogenomic approach to PNP discovery, while useful,57 is somewhat limited in analyzing large spectral datasets (such as LC-MS/MS datasets from bacterial extracts) and complex patterns of modifications. Moreover, this approach relies on identifying long peptide sequence tags (4–5 amino acids) to reduce the search space.48 Such long tags are often not available for multicyclic peptides such as lanthipeptides or for NRPs with non-standard amino acids. Also, since the manual approach does not provide estimates of statistical significance (a pre-requisite for analyzing large spectral datasets) an automated peptidogenomics software tool is needed.
Peptidogenomics is based on the comparison of experimental spectra with the theoretical spectrum of a PNP. Various bond disconnection algorithms58–64 generate a list of bonds between atoms in a compound (excluding hydrogen) and assign them a breakage score based on the likelihood of each bond being disconnected. The theoretical spectrum is constructed from masses and breakage scores of all substructures resulting from bond disconnections. Tools such as MetFrag59 attempt to explain the peaks in the experimental spectrum using the likely substructures formed by disconnecting some bonds. Alternative machine learning approaches use large collections of MS/MS spectra for learning the rules governing the MS/MS fragmentation process.65–67 Alternative approaches to bond disconnection algorithms have also been suggested.65–68
Theoretical spectra of PNPs are formed by disconnecting only amide bonds (rather than all bonds)69,70 (see Fig. 3). Since the number of fragmented substructures grow quadratically with the PNP length (under the assumption that at most two amide bonds are disconnected), theoretical spectra of PNPs have a large number of masses making it difficult to analyze them since only a fraction of these masses have counterparts in the experimental spectra. In spite of this complication, some studies used general metabolite dereplication tools to successfully dereplicate PNPs.35
A Peptide-Spectrum Match (PSM) is a pair of a peptide and a spectrum with the same precursor mass (up to an error δ). In the context of PNP discovery, a PSM score is often defined as the number of peaks shared between a theoretical spectrum and an experimental spectrum. Given a spectrum, a peptide that forms a PSM with the highest score against this spectrum (among all peptides in a peptide database) is reported as a potential annotation of the spectrum.
It is well known in the context of traditional proteomics that PSM scores often poorly correlate with statistical significance of PSMs such as p-values.71 This observation is greatly amplified for non-linear peptides since scoring PSMs formed by non-linear peptides is currently more primitive than scoring PSMs formed by linear peptides due to the lack of a large learning sample of PSMs formed by non-linear peptides.
To address this challenge, Ibrahim et al., 2013 (ref. 70) proposed additional statistical measures to distinguish between correct and erroneous PSMs formed by PNPs in their iSNAP approach. Mohimani et al.,72 developed the MS-DPR algorithm for computing p-values of PSMs formed by arbitrary PNPs. MS-DPR addresses the problem of deciding whether a given spectrum was generated by a linear, cyclic, or branch cyclic peptide since it enables evaluation of statistical significance of peptides with diverse structures72 (see Fig. 4).
Fig. 4 Deciding whether a peptide that produced a spectrum is linear, cyclic or branch cyclic. Given a spectrum, MS-DPR72 considers various structure assumptions for a peptide that generated the spectrum (e.g. linear, or cyclic, or branch cyclic), and derives a p-value of PSMs resulting from each such assumption. For each structure, MS-DPR explores many putative amino acid sequences (shown by different colors) to estimate the p-value. If one of the structures results in a small p-value (e.g. a linear structure with a p-value of 0.0001 shown in red), that structure is accepted as the most likely structure for a given spectrum. Note that even though the linear peptide in this example has the lowest score, it is the most statistically significant among the three structures. The figure is reproduced from ref. 72 by permission from ACS publications. |
Given a set of peptides P1, … Pm, their peptide network is a graph with nodes P1, … Pm, and edges connecting two peptides if they differ by a single amino acid modification. Fig. 5(a) shows the peptide network for nine variants of tyrocidine, a family of NRPs from Bacillus brevis.73 For example, peptide 1 (tyrocidine B1) in this network (red node) is connected to four peptides differing from tyrocidine B1 by a single modification: tyrocidine A1 (peptide 2), tyrocidine B (peptide 5), tyrocidine C1 (peptide 8), and a previously unreported peptide with a mass of 1338.7 (peptide 9). However, it is not connected to peptides 3, 4, 6 and 7 since they differ from peptide 1 by multiple modifications. Six of these nine tyrocidines (1, 2, 3, 5, 7, 8) are contained in the database of putative NRPs generated by NRPSpredictor2 (without modifications) and three more differ from these variants by one or two modifications/mutations.
Fig. 5 The peptide network (a) and the spectral network (b) of tyrocidines.74 The numbers within nodes represent precursor masses. Edges in the peptide network connect two peptides if they differ by a single amino acid modification. Shared edges between peptide and spectral networks are shown by thick lines. For example, peptide 1 (tyrocidine B1) in this network (red node) is connected to four peptides differing from tyrocidine B1 by a single modification: tyrocidine A1 (peptide 2), tyrocidine B (peptide 5), tyrocidine C1 (peptide 8), and a previously unreported peptide with a mass of 1338.7 (peptide 9). However, it is not connected to peptides 3, 4, 6 and 7 since they differ from peptide 1 by multiple modifications. In part (c), annotation of each node in the spectral network is shown. The spectral network revealed two novel tyrocidine variants at masses 1294.7 and 1338.7. The figure is reproduced from ref. 74 by permission from ACS publications. |
In reality, we are not given peptides P1, … Pm but only their spectra S1, … Sm. Nevertheless, one can approximate the peptide network by constructing the spectral network on nodes S1, … Sm where spectra Si and Sj are connected by an edge if they can be aligned against each other using a spectral alignment.23,25,75Fig. 5 shows the peptide and spectral networks of nine tyrocidines and illustrates that the spectral network captures all edges of the peptide network. While the peptide and spectral networks in Fig. 5 are not identical, their shared edges usually allow one to interpret the peptides corresponding to the nodes of the spectral network using the spectral network dereplication algorithm.76 The algorithm starts from a node with a known annotation in the spectral network, and propagates annotations from known to unknown peptides through the edges of the network.
Ng et al.,69 proposed the first method for the dereplication of cyclic PNPs. Ibrahim et al.,70 proposed an alternative dereplication approach, iSNAP, that is not limited to cyclic NRPs but extends to branch cyclic and linear peptides.
iSNAP analyzes each spectrum using the following steps: (i) identify all amide bonds for each NRP in the chemical database, (ii) generate a theoretical spectrum for each NRP by cleaving at most two amide bonds at a time, (iii) generate PSMs formed by the experimental spectrum and all NRPs in the database whose mass matches the precursor mass of the spectrum, and (iv) score resulting PSMs, estimate their statistical significance, and report statistically significant PSMs.
While dereplication via a spectral library search is more accurate than dereplication via a search in a chemical database, the spectral libraries still contain only a fraction of PNPs present in chemical databases, e.g., as of August 2015, only 81 out of 1607 annotated spectra in GNPS Molecular Networking dataset4 represented PNPs. Therefore, the application of spectral libraries for PNP dereplication remains limited. For example, Milman and Zhurkovich96 described the dereplication of toxic NRPs based on a small spectral library consisting of only 263 spectra.
Watrous et al.18 constructed spectral networks of various bacterial extracts and dereplicated many PNPs using a manual analysis of connected components in these networks. Various studies reported success in utilizing spectral networks for the discovery of natural products.4,97–108
For example, Mohimani et al.97 discovered a lanthipeptide informatipeptin, a doubly charged ion with m/z 1065.5, using the RiPPquest algorithm. This PNP belonged to a connected component of the spectral network and was connected with three doubly charged ions with m/z of 929.2, 957.5, and 1015.1, comparing the mass shifts between these ions and informatipeptin provided a hint that these peptides are N-terminal derivatives of informatipeptin. While the three resulting PSMs had borderline statistical significance and RiPPquest did not report them as significant discoveries, the fact that they clustered with informatipeptin in the spectral network provided evidence that they are indeed N-terminal derivatives of informatipeptin (Fig. 6).
Fig. 6 Spectral network analysis leads to variable dereplication of RiPP informatipeptin (shown in orange) into 3 variant PNPs.97 Ser → Dha and Thr → Dhb conversions in this lanthipeptide are shown in green and red, respectively. The figure is reproduced from ref. 97 by permission from ACS publications. |
Allmer et al.111 recently reviewed various approaches to de novo sequencing of linear peptides. However, while dozens of tools for de novo sequencing of linear peptides have been proposed,112–115 techniques for de novo sequencing of non-linear peptides are still in the early stages of development. Ng et al.69 proposed the first algorithm for the sequencing of cyclic peptides that however works only for very well-fragmented spectra. Novak et al.116 recently developed CycloBranch that takes advantage of high resolution mass spectrometry to improve the accuracy of de novo sequencing of cyclic, branched, and branch cyclic peptides.
Mohimani et al.76 developed a multiplex de novo peptide sequencing algorithm for the case when spectra of multiple related peptides are available. Multiplex peptide sequencing starts from constructing the spectral network and identifying clusters of related compounds (connected components in the spectral network). It further attempts to sequence all compounds in each connected component (see Fig. 7). As opposed to PNP dereplication via spectral networks (when at least one spectrum in the connected component represents a known compound), de novo PNP sequencing works even when all nodes in the connected component represent unknown compounds. The advantage of spectral networks for PNP sequencing is that finding PNPs that simultaneously explain all spectra in a connected component of a spectral network results in a more accurate approach than sequencing each individual spectrum. When tandem mass spectrometry (MS2) fails to sequence a PNP, one can attempt multistage (MSn) mass spectrometry and apply the multistage de novo peptide sequencing approach109,117 (see Fig. 8).
Fig. 7 The spectral network dereplication algorithm from ref. 76 attempts to de novo sequence all spectra in a spectral network in a coordinated fashion. It starts from a putative interpretation of one of the spectra (bottom left node) and propagates this interpretation to other nodes using red edges. The propagation typically fails if the initial putative interpretation is incorrect and succeeds if it is correct. Thus, the propagation process allows one to reject the incorrect initial interpretations. The spectral network dereplication algorithm generates many putative interpretations of the spectrum and propagates them through the spectral network in an attempt to decide which one is correct. |
Fig. 8 Illustration of an algorithm for peptide sequencing by multistage mass spectrometry. For each candidate peptide, the multistage peptide sequencing algorithm scores how well each subpeptide is explained by the MSn data.109 The figure is reproduced from ref. 109 by permission from Wiley publications. |
Even when the core RiPP sequence is known and the types of PTMs in a RiPP can be predicted, multiple possible PTM sites typically result in thousands of structures that are difficult to analyze. Due to this complication, computational approaches for RiPP identification could not keep pace with the rapid progress in RiPP discovery in recent years. Cycloquest,119 a tool for RiPP identification, is limited to cyclic peptides with very few modifications. Also, since Cycloquest does not take advantage of genome mining, it is unable to identify poorly fragmented peptides (e.g., lanthipeptides).
Genome mining is crucial for the success of RiPP identification efforts. The statistical significance (E-values) of the found PSMs deteriorates with an increase in the size of the protein database. Thus, one way to make PSMs formed by poorly fragmented spectra statistically significant is to reduce the effective size of the protein database. Fortunately, most RiPPs appear in small windows of ≈20000 nucleotides around biosynthetic gene clusters, and these clusters can be identified by searching for conserved biosynthetic enzymes. Thus, limiting the search space to this small region of the genome has the potential to reduce the E-values of found PSMs by orders of magnitude thus separating them from false PSMs.
RiPPquest97 is a RiPP database search tool that addresses these complications and uses a more involved pipeline than peptide identification tools in traditional proteomics (compare Fig. 9(a) with 9(b)). While RiPPquest is currently limited to lanthipeptide analysis, it can be extended to other RiPP classes as soon as (i) it implements a genome mining rationale for a specific RiPP class, and (ii) it implements a biosynthetic rationale for transforming core peptides into mature peptides for a specific RiPP class.
Fig. 9 (a) Standard proteomic database search tools (e.g., Sequest9) are based on digesting proteins by an enzyme, and collecting tandem spectra of the resulting peptides. Each spectrum is matched against theoretical spectra of all peptides in a protein database and PSMs with the highest scores/lowest p-values are reported. (b) RiPPquest97 includes the following steps: (i) identifying RiPP synthetases in the genome, (ii) extracting candidate ORFs in a window around the gene cluster, (iii) adding proper modifications, (iv) matching spectra against the database of putative RiPPs and computing p-values of resulting PSMs, and (v) refining and enlarging the set of identified NRPs using spectral networks. The figure is reproduced from ref. 97 by permission from ACS publications. |
Zhang et al.120 recently developed the Hypothetical Structure Enumeration and Evaluation (HSEE) algorithm for RiPP identification and applied it for the identification of the lanthipeptide prochlorosin. HSEE is based on matching spectra against a collection of hypothetical structures predicted based on the biosynthetic gene cluster. HSEE generates a theoretical spectrum for each hypothetical structure and scores structures based on the shared peak count between the theoretical and experimental spectrum. The structure with the highest score is reported as a putative interpretation of an experimental spectrum.
We illustrate the PNP identification pipeline using RiPPquest97 that includes the following steps: (i) identifying RiPP synthetases in the genome, (ii) extracting candidate open reading frames (ORFs) in a window around the gene cluster, (iii) adding proper modifications, (iv) matching spectra against the database of putative RiPPs and computing p-values of resulting PSMs, and (v) refining and enlarging the set of identified RiPPs using spectral networks (Fig. 9(b)). Below is a brief description of the RiPPquest pipeline:
RiPPquest uses genome mining tools such as BAGEL,51,52 ThioFinder53, and antiSMASH40–42,121 for the identification of RiPP gene clusters. Limiting the search to small windows centered at RiPP gene clusters decreases the search space by two orders of magnitude as compared to the entire Streptomyces genome. Candidate core peptides are extracted from short ORFs found in these windows, and transformed to mature peptides according to the biosynthetic enzymes in the gene cluster. In the case of lanthipeptides, the most essential modifications are the dehydration of serine and threonine, and the formation of lanthionine and methyl-lanthionine bridges.49
Most classes of RiPPs form families of related peptides, making spectral networks helpful in RiPP analysis.18 In particular, spectral networks revealed related lanthipeptides with stepwise N-terminal leader processing and different dehydration numbers97 (see Fig. 6).
NRPquest74 uses a genome sequence and a spectral dataset as an input and includes the following steps (i) identifying NRP synthetases in the genome, (ii) using the non-ribosomal code to construct a database of putative NRPs generated by each NRP synthetase, (iii) matching spectra against the database of putative NRPs and computing p-values of resulting PSMs, and (iv) refining and enlarging the set of identified NRPs using spectral networks (Fig. 10).
Fig. 10 NRPquest74 pipeline includes the following steps: (i) identifying NRP synthetases in the genome, (ii) using the non-ribosomal code to construct a database of putative NRPs generated by each NRP synthetase, (iii) matching spectra against the database of putative NRPs and computing p-values of resulting PSMs, and (iv) refining and enlarging the set of identified NRPs using spectral networks. The figure is reproduced from ref. 74 by permission from ACS publications. |
NRPquest uses NRPSpredictor2 (ref. 39) and antiSMASH40–42 to identify NRP synthetases in the genome, and to predict the set of all possible amino acids generated by each A-domain. NRPquest further searches the genome for a methylation domain, epimerization domain, and side chain bond formation domain, and accounts for the corresponding modifications in the database of putative NRPs. Each spectrum is matched against each putative peptide in the database of putative NRPs using a brute force algorithm that allows for up to two blind modifications. These blind modifications account for potentially inaccurate predictions of specificities of A-domains that are particularly common for non-proteinogenic amino acids and modifications. Similar to RiPPquest, NRPquest scores the PSMs using the scoring function from ref. 114 and computes p-values using the MS-DPR algorithm.72
NRPquest constructs a spectral network23 to refine and enlarge the set of identified PSMs. After constructing the spectral network, its connected components are extracted and the spectral network dereplication algorithm from Mohimani et al., 2011 (ref. 76), is used for the identification of all peptides represented by spectra forming this connected component. Moreover, the peptide propagation through the spectral network23,69 allows one to identify peptides with many modifications that NRPquest missed during blind searches of individual spectra (since blind searches with more than two modifications are prohibitively time-consuming).
We described three approaches to PNP discovery with their own merits and limitations: PNP dereplication, PNP sequencing and PNP identification. For example, while PNP dereplication requires a chemical structure database to be available and can only identify known compounds and their variants, PNP identification requires the genome sequence to be available and can identify new compounds. The fragmentation quality of spectra required for the success of these approaches is vastly different, as they perform searches in vastly different computational spaces.
While the search space for PNP dereplication is usually small, the search space for PNP sequencing is very large since it includes all peptides with a given mass. PNP identification, for both NRPs and RiPPs, has a search space that typically includes under a million putative peptides, standing in between PNP dereplication and PNP sequencing with respect to the search space. Thus, while PNP sequencing can succeed only with extremely high quality spectra, PNP identification can succeed with a medium quality spectra, and PNP dereplication can succeed even with poorly-fragmented spectra. Since PNP sequencing using a single tandem mass spectrum rarely succeeds, researchers have tried to utilize information from multiple spectra/multistage MS to overcome this limitation.23,76,109,117,122
This journal is © The Royal Society of Chemistry 2016 |