Qi
Wu
ab,
Yichu
Shan
a,
Yanyan
Qu
ab,
Hao
Jiang
ab,
Huiming
Yuan
a,
Jianxi
Liu
ab,
Shen
Zhang
ab,
Zhen
Liang
a,
Lihua
Zhang
*a and
Yukui
Zhang
a
aNational Chromatographic Research and Analysis Center, Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, China. E-mail: lihuazhang@dicp.ac.cn; Fax: +86 411 8437 9720; Tel: +86 411 8437 9720
bUniversity of Chinese Academy of Sciences, Beijing, China
First published on 17th October 2013
Proteome scale absolute quantification is fundamental for the quantitative understanding of an organism. The unsatisfactory accuracy for protein abundance estimation of current algorithms has been partially improved by the Absolute Protein EXpression profiling (APEX) algorithm, which implements the prior expectations of peptides' appearances in the calculation of protein abundances. However, the abundance feature (AF) in APEX is the spectral count (SC); an AF suffers from a narrow dynamic range, thus, unsatisfactory accuracy. Therefore, we adopted another tandem mass spectrometric (MS/MS) level AF called Summed MS/MS Total ion current (SMT), which cumulates the MS/MS fragment intensities rather than simply counting the MS/MS spectra, to surmount this particular deficiency. The combination of APEX and SMT (abbreviated as APEX-SMT) is capable of improving the accuracy of absolute quantification by reducing the average relative deviation by ∼55–85% compared to that of APEX, through a series of tests on the Universal Proteomics Standard sample with a dynamic range of 5 orders of magnitude (UPS2). The algorithm could also be used for relative quantification. When applied to the relative quantification of a publicly available benchmark dataset, APEX-SMT could provide comparable accuracy to APEX. All these results suggest that APEX-SMT is a promising alternative to APEX for proteome quantification.
Although mass spectrometry (MS)-based absolute quantification of proteins has been achieved by spiking in known amounts of isotopically labeled peptides or proteins (e.g., Absolute Quantification of proteins, AQUA;3 Selected Reaction Monitoring or Multiple Reaction Monitoring assay, SRM/MRM;4 Protein Standard Absolute Quantification, PSAQ5), the proteome scale absolute quantification remains a privilege for label-free methods, which mainly focuses on development of new algorithms to meet the need for accurate and reproducible quantification. Many abundance features (AFs) in MS detection have been successfully utilized for the estimation of absolute quantities of all identified proteins in a sample,6 such as the precursor intensity (top 3 most abundant peptides,7 intensity-Based Absolute Quantification, iBAQ8), peptide coverage (Exponentially Modified Protein Abundance Index, emPAI9), tandem mass spectrometric (MS/MS) spectral count (SC) (Normalized Spectral Abundance Factor, NSAF10) and the newly accepted MS/MS fragment intensity (Spectral Index, SI;11 Summed MS/MS Total ion current, SMT12–14).
It is worth noting that the detectabilities and intensities of peptides on a certain liquid chromatography (LC)-MS platform could vary dramatically due to the different physicochemical properties of peptides. In other words, different proteins of equal amount could yield very diverse MS AFs.15 With such a phenomenon taken into consideration, the quantification accuracy might be improved. Lu et al. applied machine learning methods to reduce this discrepancy and developed a SC-based algorithm, named “Absolute Protein EXpression profiling” (APEX),16 which correlates well with other non-MS measurements, such as western blotting, two-dimensional gels, flow cytometry, and mRNA abundance validation, for both yeast and Escherichia coli proteomes. Different from precursor-based quantification, the information for MS/MS-based quantification is readily available. Therefore, algorithms based on SC have received massive adoption in the label-free quantification community.17 However, they suffer from narrow dynamic ranges, which could seriously jeopardize the quantification accuracy.14
Recent discoveries12–14,18 including our previous work find that summations of MS/MS fragment intensities are promising alternatives to broaden the quantitative dynamic ranges. Two major forms have been developed, abbreviated as SMT12–14 and SI.11 The former cumulates all fragment intensities of all MS/MS spectra assigned to a protein, while the latter cumulates only the intensities of matched fragments in all MS/MS spectra assigned to a protein. To the best of our knowledge, the combination of APEX and SMT (or SI) has not been reported yet. This combination has the potential to broaden the dynamic range of absolute quantification and preserve the merit of machine learning algorithms.
Herein, SC is replaced with SMT and SI in the APEX formula, resulting in two modified versions of the APEX algorithm, which are named as APEX-SMT and APEX-SI. Through a series of tests using the Universal Proteomics Standard sample with a known dynamic range (UPS2), we find that compared to APEX and APEX-SI, APEX-SMT could yield wider dynamic range and better accuracy. In the meantime, the benefit of machine learning methods is indeed preserved. Moreover, we find that any existing high quality dataset on a certain instrumental setup could be utilized for training. Furthermore, through testing the algorithms on a publicly available benchmark dataset,19 we find that APEX-SMT could offer comparable accuracy to APEX in relative quantification as well.
(1) |
The calculations of SC, SI and SMT are shown in formulae (2)–(4), which are similar to our previous strategy,12 except that SC, SI and SMT of shared peptides are proportionally divided into the corresponding proteins according to the weight value reported by ProteinProphet, to avoid the multiple counting effect when dealing with shared peptides.20
(2) |
(3) |
(4) |
Replacing SC in the APEX formula with SI and SMT would generate two modified versions of APEX, which are named as APEX-SI and APEX-SMT, as defined in formulae (5) and (6):
(5) |
(6) |
Furthermore, the training-free algorithms (normalized AF) were used as comparisons, where AF referred to SC, SI or SMT. They are in the manner of formula (7):
(7) |
Fig. 1 Pipeline of APEX, APEX-SI and APEX-SMT implementation. Four primary processes are depicted in rounded rectangles while files and software are depicted in rectangles. The files generated during the processes are shown in italic, and the software required is shown on the right side rectangle of the corresponding process. The whole processes except database search and post-processing are wrapped in a Java code package consisting of some source code from ref. 21 and some written in-house. |
RAW files, no matter generated in our own lab or downloaded from Proteome Commons, were first converted to the mzML format using msconvert.exe, then to the mgf format using mzxml2search.exe with default parameters, except altering mass precision to 5 and intensity precision to 1. Msconvert.exe and mzxml2search.exe are both components of TPP. The database searches were performed with an in-house installation of Mascot25 (version 2.3.02). The databases used are all in the target-reverse forms. The introduction of reversed sequences is used for FDR calculations.
The databases used are as follows: the yeast sequences in the Swissprot repository (downloaded from http://www.uniprot.org on August 31st, 2012), concatenated with their reversed forms (7786 target sequences and 7786 reverse sequences, named as SP_yeast_20120831_plus_RE.fasta), are used for searching yeast data in the AQE dataset and samples A and B in the Orbitrap@O86 dataset; the rat sequences in the International Protein Index (IPI) database (version 3.87), concatenated with their reversed forms (39931 target sequences and 39931 reverse sequences, named as ipi.RAT.v3.87_plus_RE.fasta), are used for searching rat data in the AQE dataset; the 48 UPS2 sequences, concatenated with their own reversed forms and the reversed forms of yeast sequences (48 target sequences and 7834 reverse sequences, named as UPS2_plus_RE_plus_yeastREonly.fasta), are used for searching UPS2 data in the AQE dataset; the 48 UPS1 sequences (notably, UPS1 and UPS2 have 46 identical sequences, while each has 2 unique sequences), concatenated with the yeast sequences and the reversed forms of both UPS1 and yeast sequences (7834 target sequences and 7834 reverse sequences, named as UPS1_plus_yeast_20120831_plus_RE.fasta), are used for searching samples D and E in the Orbitrap@O86 dataset.
The database search parameters are dependent on the sample preparation procedures and the resolution of the mass spectrometer in use. The search parameters for LTQ XL data are as follows: precursor and fragment mass tolerances are 2 Da and 1 Da respectively; only 2+, 3+ and 4+ tryptic peptides with at most two missed cleavage sites are considered; carbamidomethylation of cysteine is set as fixed modification while oxidation of methionine is set as variable modification. The search parameters for Orbitrap Velos data are similar to those for LTQ XL, except the followings: precursor and fragment mass tolerances are 10 ppm and 0.5 Da respectively; observed precursor masses are permitted to differ from the peptide monoisotopic mass by 1 or 2 Da in case that a non-monoisotopic peak is chosen for fragmentation.
Dataset | AQE dataset | Orbitrap@O86 dataset | Empirical thresholds or (previously reported values) from ref. 16 and 26 | ||||
---|---|---|---|---|---|---|---|
Instrument | LTQ XL | Orbitrap Velos | Orbitrap XL | ||||
Data source for training | Yeast | Rat | Yeast | Rat | Yeast | ||
a This value was from ref. 16 since it was not reported in ref. 26 all other thresholds or reported values were from ref. 26. | |||||||
Parameters used for training | Protein probability | Equals 1 | Equals 1 | ||||
SC threshold | ≥50 | ≥70 | ≥30 | Depends on the system setup | |||
Number of proteins entering the training process | 105 | 64 | 105 | 92 | 74 | ∼30–150 (89) | |
Fraction of observed peptides | 0.13 | 0.08 | 0.13 | 0.10 | 0.06 | Bigger the better (0.09) | |
Machine learning algorithm in use | A cost-sensitive classifier based upon bagging with a random forest (weka.classifiers.meta.CostSensitiveClassifier -cost-matrix “[cost matrix]” -S 1 -W weka.classifiers.meta.Bagging -- -P 100 -S 1 -I 10 -W weka.classifiers.trees.RandomForest -- -I 10 -K 5 -S 1) | A cost-sensitive classifier based upon bagging with a random forest | |||||
Estimators generated during the 10-fold cross-validation | Percentage of correctly classified instances | 92.3% | 94.3% | 91.5% | 92.4% | 95.7% | Bigger the better, (86%)a |
F-measure of the observed peptides | 0.713 | 0.672 | 0.691 | 0.654 | 0.712 | >0.5, bigger the better (0.61) | |
Recall of the observed peptides | 0.758 | 0.748 | 0.731 | 0.731 | 0.840 | Bigger the better (0.63) |
Algorithm | LTQ XL (22 proteins) | Orbitrap Velos (30 proteins) | ||||
---|---|---|---|---|---|---|
Training by yeast | Training by rat | Normalized AF | Training by yeast | Training by rat | Normalized AF | |
APEX (normalized SC) | 18.5 | 18.6 | 18.7 | 24.8 | 24.8 | 25.1 |
APEX-SI (normalized SI) | 42.8 | 42.3 | 46.3 | 51.4 | 51.9 | 59.0 |
APEX-SMT (normalized SMT) | 25.3 | 25.3 | 25.9 | 30.3 | 30.3 | 30.1 |
(8) |
(9) |
Fig. 2 Relationships between injected amounts and calculated quantities for the AQE dataset. (A–C), LTQ XL, yeast-training; (D–F), LTQ XL, rat-training; (G–I), LTQ XL, training free; (J–L), Orbitrap Velos, yeast-training; (M–O), Orbitrap Velos, rat-training; (P–R), Orbitrap Velos, training free. After logarithmic transformation, the average value for the final calculated quantities of proteins in every molar level was plotted as a black dot, and standard deviation (s.d.) was plotted as an error bar. In some cases, the s.d. is larger than the average value, leading to the inability to calculate the log10 (average-s.d.). The lower limit of the error bar is extended unlimitedly to the y minus direction to represent this situation. VAR and UNI, as described in formulae (8) and (9), together with the linear fitting equation, are shown in the lower right corner of each individual figure. VAR represents the total deviation regardless of underestimation or overestimation while UNI represents the uniformity of the deviation; both are closer to 1 the better. As for the linear fitting equation, the slope and R2 closer to 1 and the intercept closer to 0 indicate better accuracy. |
VAR, UNI and the linear fitting equation were printed in the lower right corner of each individual figure. VAR represents the total deviation, regardless of overestimation or underestimation, and is surely a positive number. UNI represents the uniformity of the deviation, so it can be either positive or negative. Positive UNI indicates overall overestimation while negative one indicates overall underestimation. For both the closer to 0 the better. As pointed out by Spinelli et al.,18 the linearity (Pearson's R2) of the linear fitting alone cannot reflect the accuracy comprehensively; the slope and intercept of the linear fitting equation matter as well. The slope closer to 1, intercept closer to 0 and R2 closer to 1 indicate better accuracy.
From Fig. 2A–F and J–O, we could see a sharp reduction of VAR and UNI for APEX-SMT compared to those of the corresponding APEX and APEX-SI on both instruments. Also, the slope of APEX-SMT is closer to 1 than APEX and APEX-SI, and the intercept of APEX-SMT is closer to 0 than APEX and APEX-SI. Even the linearity (R2) of APEX-SMT is better than APEX and APEX-SI on LTQ XL (the linearity of APEX-SMT is slightly worse than APEX and APEX-SI on Orbitrap Velos due to the obvious overestimation on the 500 amol molar level). The same trend is observed for the normalized AF algorithms (Fig. 2G–I, P–R). All these results suggest that APEX-SMT surmounts corresponding APEX or APEX-SI in accuracy considering all the estimators discussed above no matter on which instrument. The major problem of APEX and APEX-SI is the evident overestimation. In other words, the dynamic ranges of APEX and APEX-SI are too narrow to truly reflect protein concentrations of 3 orders of magnitude. In contrast, the true dynamic range of the sample could be approximately reflected by APEX-SMT. The comparison among the three algorithms could be concluded as follows. APEX-SMT shows its superiority over APEX-SI in every aspect, including better reproducibility among injections and much better ability to reflect the protein concentrations. The only drawback of APEX-SMT is its slightly worse reproducibility than APEX, which is unavoidable due to the slightly worse reproducibility of SMT itself.
The normalized AF algorithms slightly outperform corresponding training algorithms on Orbitrap Velos (Fig. 2J–R). However, all estimators are based on the average calculated quantity in each molar level; the deviation among individual proteins in each molar level remains untested. Therefore, we calculated the largest fold change among individual calculated quantities in every molar level (the largest calculated quantity divided by the smallest), and summarized the result in Table 3, from which the effect of machine learning algorithms is validated. The fold changes are significantly reduced after training at every molar level on both instruments (with 2 exceptions at 500 amol on Orbitrap). The fold changes could be up to ∼40 on LTQ and ∼110 on Orbitrap without training, while after training, they could be reduced within 10 on LTQ and ∼20 on Orbitrap. Although the average calculated quantities correlate well with injected amounts using either APEX-SMT or normalized SMT, the quantification accuracy of any individual protein could not be guaranteed by the latter. It confirms our hypothesis that the dynamic ranges are intrinsic to AFs themselves; meanwhile the machine learning algorithm does its job in reducing the discrepancy of calculated quantities of different proteins of equal amount. In other words, the combination of APEX and SMT does perverse the advantages of both.
Molar level | LTQ XL | Orbitrap Velos | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Trained by yeast | Trained by rat | Training free | Trained by yeast | Trained by rat | Training free | |||||||||||||
APEX | APEX-SI | APEX-SMT | APEX | APEX-SI | APEX-SMT | Normalized SC | Normalized SI | Normalized SMT | APEX | APEX-SI | APEX-SMT | APEX | APEX-SI | APEX-SMT | Normalized SC | Normalized SI | Normalized SMT | |
500 fmol | 2.2 | 3.3 | 7.5 | 2.2 | 2.5 | 8.6 | 15.3 | 15.0 | 38.9 | 4.1 | 4.6 | 18.2 | 4.2 | 4.2 | 18.6 | 24.1 | 27.8 | 106.3 |
50 fmol | 2.7 | 4.8 | 3.7 | 3.1 | 4.2 | 3.7 | 6.9 | 26.9 | 8.8 | 2.5 | 6.5 | 11.5 | 2.4 | 7.7 | 10.7 | 17.0 | 21.7 | 32.1 |
5 fmol | 3.6 | 2.9 | 3.7 | 4.3 | 3.4 | 4.3 | 5.3 | 5.8 | 6.0 | 2.7 | 6.5 | 6.6 | 2.9 | 6.8 | 6.9 | 8.3 | 8.8 | 8.1 |
500 amol | — | — | — | — | — | — | — | — | — | 9.6 | 22.2 | 8.1 | 9.9 | 18.7 | 7.8 | 4.2 | 29.2 | 4.0 |
We continued to conduct the comparison among APEX, APEX-SI and APEX-SMT using the yeast-training data by calculating the absolute value of relative deviation (|(calculated-injected)|/injected) for every individual protein through dividing the absolute difference between the calculated and the injected value by the injected value. The average deviations for the 22 proteins identified by LTQ are 101% (APEX), 144% (APEX-SI) and 45% (APEX-SMT), while those for the 30 proteins identified by Orbitrap are 1134% (APEX), 787% (APEX-SI) and 146% (APEX-SMT). The major deviations on Orbitrap come from the lowest molar level (500 amol). After excluding the 6 proteins at 500 amol, the deviations of the remaining 24 proteins could be reduced to 383% (APEX), 276% (APEX-SI) and 59% (APEX-SMT). The average deviations of APEX-SMT are reduced by 55% (LTQ)-85% (Orbitrap Velos) compared to APEX. Similar results were obtained from the rat-training data (Tables S2 and S3†). All these results demonstrate that APEX-SMT could obviously improve the quantification accuracy compared to APEX and APEX-SI.
Our study shows that no bias is observed along the way from training to reporting the quantification result, no matter which training organism (yeast or rat) is in use. Therefore, we could come to the conclusion that any high-quality real sample dataset on an LC-MS/MS system could be used for training on that particular system setup, allowing researchers to retroactively make use of the existing data in their own labs for either training or quantification.
The significant differentially expressed proteins were detected by the Power Law Global Error Model (PLGEM)28 (version 1.24.0) under an R environment (version 2.13.0) at 14 ascending p values (p = 0.001, 0.005, 0.01, 0.02, 0.05, 0.08, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8). For details of PLGEM settings and performance evaluation, please refer to the ESI (Fig. S2†). The significant differentially expressed proteins' list contains both UPS1 proteins and yeast proteins. The significant changes of UPS1 proteins are defined as “true positive” changes while those of yeast proteins are defined as “false positive” changes. Therefore, the false positive rate (FPR) could be defined as the proportion of false positive yeast proteins in all identified yeast proteins (1074), while the sensitivity could be defined as the proportion of true positive UPS1 proteins in all identified UPS1 proteins (44). FPR versus sensitivity was plotted, as shown in Fig. 3B. Note that almost all statistical models including PLGEM have a high demand on the reproducibility of the AF to discriminate the true significantly changed proteins from the false ones. We are not surprised to observe that the performance of APEX-SI is worse than the other two in the whole FPR range. APEX and APEX-SMT perform almost the same in the low FPR range (≤0.4). These results indicate the similar ability of APEX-SMT in detecting differentially expressed proteins compared to that of APEX in the reasonable FPR range. All these results demonstrate the reliability of APEX-SMT to be applied in relative quantification.
The appropriate incorporation of MS level AFs, such as the precursor intensity or area under the curve (AUC), preserves the potential to further improve the quantification accuracy. This study could help us to open up more studies regarding the combination of the computational methods to correct the mass spectrometric bias of disparate proteins and the novel types of reproducible and accurate AFs. Hopefully, researchers could ultimately get an accurate perspective of biological and metabolic processes.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c3an01738a |
This journal is © The Royal Society of Chemistry 2014 |