Yankun
Li
* and
Xiangchao
Zeng
College of Environment Science and Engineering, North China Electric Power University, Baoding, Hebei, China 071003. E-mail: 309267061@qq.com
First published on 16th November 2015
SELDI-TOF MS serum peptide profiles of malignant and benign ovarian tumor samples were studied using a pattern recognition technique. The model of uncorrelated linear discriminant analysis (ULDA) combined with variables selection method of variance analysis was constructed to identify ovarian tumor serum samples and compared with the results obtained from principal component analysis (PCA) and partial least squares-discriminate analysis (PLS-DA). In addition, special peaks (m/z locations) as potential biomarkers were selected in this study. The good results indicate that the strategy of ULDA combined with variables selection applied to serum SELDI-TOF MS is a practicable and promising method for the ovarian malignant and benign tumor identification and selection of potential biomarkers.
To further the in-depth study of high flux mapping and extract characterized illness information, effective chemometric (chemical informatics) methods have been introduced.1–3 Pattern recognition techniques are concerned with the theory and algorithm of abstract objects, e.g. measurements made on physical objects placed into categories (clustering). The methods of pattern recognition are useful in many areas such as information retrieval, data mining, document image recognition and bioinformatics. Some pattern recognition methods, for example, principal component analysis (PCA), soft independent modeling of class analogy (SIMCA) and partial least squares-discriminate analysis (PLS-DA),4,5 have been used in near-infrared spectroscopy (NIRs), infrared spectroscopy (IRs), and fluorescence spectroscopy (FS).6,7
It is commonly believed that proteins, peptides and metabolites in human blood (or body fluids) can reflect the state of the human body in an accurate and timely manner. Surface enhanced laser desorption/ionization time of flight mass spectrometry (SELDI-TOF MS)8,9 is one of the powerful tools used in the study of proteomics and it has been widely applied to analyze body fluids, including serum and spit. SELDI-TOF MS serum peptide profiles contain the contents of all peptides in blood, which can reflect the individual differences and the health state of the human body, and is also known as the fingerprint serum peptide spectrum. At present, most study has used this statistical method by comparing the absorption intensity to screen out the different proteins between normal group and patients group from the fingerprint spectrum. However, chemical information method obtained is hoped to extract implicit information about the disease from serum protein profiling and used to establish a model for cancer diagnosis. This way, the error caused by the statistical test or the related pre-processing method can be avoided and can also solve the difficult problem of differential protein screening.
Using uncorrelated linear discriminant analysis (ULDA), as one of the pattern recognition methods, the extracted features are shown to be statistically uncorrelated. ULDA has been successfully applied in the data analysis of metabolomics, proteomics and gene expression profile.10 ULDA is based on linear discriminant analysis (LDA) and used to find the best classification of subspace and characteristic variables. When compared with the Fisher discriminant vector from traditional Fisher discriminant analysis, discriminant vectors are related to each other to retain more information.
Moreover, better calibration models may be obtained by selecting characteristic variables, including sample-specific or component-specific information instead of the full-spectrum. Accordingly, several methods have been developed, for example, interval PLS (iPLS),11 stepwise regression analysis (SRA),12 Monte Carlo-uninformative variable elimination (MC-UVE)13,14 and genetic algorithms (GA).15 In this essay, the variables selection of variance analysis method16 was first adopted to select the characteristic variables from SELDI-TOF MS serum peptide profiles. Sample-specific or component-specific information were retained and useless information abandoned at the same time. As a result, calibration modeling was improved and predigested significantly with fewer variables.17
Then, the ULDA algorithm was used to classify the SELDI-TOF MS serum peptide profiles of malignant and benign ovarian tumor samples; both the sensitivity and specificity were 100%. At the same time, several peaks (m/z locations) as potential biomarkers were selected using the transformation vector of ULDA. Finally, conventional PLS-DA and PCA methods combined with variance analysis were also used to classify the same data and the classification results found to be greatly inferior to those obtained with ULDA.
As a result, the method based on ULDA combined with variance analysis can be applied for the identification of malignant and benign ovarian tumor SELDI-TOF MS profiles to provide a new way of exploring the relations in SELDI-TOF MS profiles and cancer characteristics by selecting potential cancer biomarkers, and consequently constructing a highly effective cancer diagnosis model.
A key property of ULDA is that it removes the correlation among features in the transformed space, so that the features in the reduced space are uncorrelated to each other. The correlation discriminant vector (UDV) obtained from ULDA has better classification ability. The specific algorithm has been reported in the literature.24
Assuming that a given data matrix X = (xij) ∈ Rn×p, n and p represent the number of samples and variables, respectively. Assuming that the sample data belongs to the type k, the average of the whole data set is ciT ∈ R1×p, where T represents the transpose of the vector or matrix.
(1) According to the formula and Hb and Ht can be calculated.
(2) Convert the singular value decomposition to HtT, HtT = U∑VT.
(3) Construct U1 = [u1, …, ur] making sure that ui(i = 1, 2, …, r) is the r line of matrix U. r is equal to the rank of St [r = rank(St)]. The total scatter matrix St = HtTHt.
(4) Construct making sure that λi(i = 1, 2, …, r) is the number of i elements in the diagonal of matrix ∑.
(5) Assuming that Convert the singular value decomposition of matrix B,
(6) According to matrix A = [a1, …, aq, aq+1, …, ar] can be calculated.
(7) Collect the column before q of matrix A to structure the transformation matrix G. G = [a1, …, aq] and q are equal to the rank of [q = rank(Sb)]. The between-class scatter matrix Sb = HbTHb.
According to the formula Z = XG, a new low dimension data matrix Z is calculated. Then, for the new data Xnew, Znew = XnewG.
Fig. 1 shows the SELDI-TOF MS profile of one ovarian tumor example.
Then, the sample data were divided into two parts. 50 samples of cancer patients and 8 benign samples were arbitrarily chosen for modeling. The remaining 50 samples of cancer patients and 8 benign samples were used for prediction studies.
Matlab 7.0 was used as the calculation software.
Fig. 2 The uncorrelated discriminant vectors of ULDA (○: malignant samples of modeling; *: benign samples of modeling; +: malignant samples of prediction; ●: benign samples of prediction). |
It can be clearly observed from Fig. 2 that the uncorrelated discriminant vectors obtained by ULDA can completely distinguish the cancer samples and benign samples. The samples were classified with 100% sensitivity (percentage of cancer samples correctly identified) and 100% specificity (percentage of benign samples correctly identified) for both the training set and the prediction set.
Because UDV is a linear combination of the original variables (spectra) defined by the coefficients of the transformation vector G, the transformation vector G obtained by ULDA can be looked upon as “loadings” of ULDA. The larger the absolute value of the loadings is, the more important the variable (m/z location) to the classification. A plot of G for the 100 variables is given in Fig. 3, and 7 peak locations with the highest absolute value of G were selected, which can be regarded as potential biomarkers for identifying ovarian cancer samples and benign samples. They were marked up in the transformation vector plot of ULDA with m/z values of 45.7, 45.8, 116.2, 143.4, 4073.6, 8085.3 and 8150.9.
Fig. 3 The transformation vector plot for ULDA. Seven peak locations with the highest absolute values of G were selected as potential markers. |
To evaluate the performance of the 7 potential biomarkers selected above, new training and prediction matrix were reconstructed from the original ones by collecting the mass spectra only at the 7 m/z locations, which were then analyzed using ULDA. The classification results of the whole data set are presented in Fig. 4. It can be observed that although the UDV values of the two classes were not concentrated as observed in Fig. 2, most of the UDV values for the cancer samples were higher than −1.40 and the UDV values for the benign samples were below −1.40. At the watershed of −1.40, the ULDA obtained 97% (97/100) sensitivity and 100% (16/16) specificity. The results revealed that the proposed method is promising for the selection of potential biomarkers. Biomarker screening is very important in proteomics studies. It should be pointed out that in this study the potential biomarkers selected were m/z locations but not real proteins, they are highly informative for tumor identification and contribute to further study on proteomic biomarkers.
In Conclusion, the proposed strategy based on ULDA combined with variance analysis of serum SELDI-TOF MS is promising for malignant and benign tumor identification.
After comprehensive analysis of the abovementioned results, it can be concluded that the ULDA method was superior to the PCA and PLS-DA methods when applied to classifying malignant and benign tumor serum samples.
In general, PCA and PLS-DA are commonly used methods for feature extraction. However, PCA and PLS-DA are sometimes not very efficient for proteomic data analysis, as discussed in the literature.28 A lot of dimension reduction of the learning task can be completed by PCA, but its comprehension of the characteristics is very poor; even a simple linear combination of the characteristics will also make it difficult to understand. PLS-DA considers the information obtained from the sample classification in the feature selection procedure. Nevertheless, PLS-DA is essentially a feature transformation approach, the new variables are some type of combination of the original variables. The variables with large variance or high covariance can affect the results, although those variables contain little or even no information contributing to the discrimination of samples, which may result in the loss of optimal features in some situations.29,30
The ULDA algorithm considers no correlation between column vectors in the transformation matrix; therefore, it can reduce the data redundancy after dimension reduction. UDVs with the biggest discriminant ability are extracted and vectors are not related to each other. In this case, ULDA maximizes the degree of different category samples.31
This journal is © The Royal Society of Chemistry 2016 |