Chencheng
Tang
ab,
Dongfang
Huang
a,
Xudong
Xing
*a and
Hua
Yang
*a
aState Key Laboratory of Natural Medicines, China Pharmaceutical University, Nanjing 211198, China. E-mail: mrxing_xudong@126.com; yanghuacpu@126.com
bSchool of Science, China Pharmaceutical University, Nanjing 211198, China
First published on 8th November 2024
Screening differential metabolites is of great significance in biomarker discovery in metabolomics research. However, it is susceptible to unwanted variations introduced during experiments. Previous normalization methods have improved the accuracy of inter-group classification by eliminating systematic errors. Nonetheless, the classification ability of differential metabolites obtained through these methods still requires further enhancement, and the reproducibility evaluation on importance rankings of differential metabolites is often disregarded. The EigenRF algorithm was developed as an improvement over the previous metabolomics normalization method referred to as EigenMS, which aims to normalize metabolomics data. Furthermore, scoring metrics, including the local consistency (LC) and overall difference (OD) scores, were introduced to evaluate the reproducibility of importance rankings of differential metabolites from a dual perspective. After conducting validation on three publicly accessible datasets, the EigenRF method has demonstrated enhanced classification ability of differential metabolites as well as improved reproducibility. In summary, EigenRF enhances the reliability of differential metabolites in metabolomics research, benefiting the further exploration of molecular mechanisms underlying biological alterations in complex matrices. The EigenRF algorithm was implemented in an R package: https://www.github.com/YangHuaLab/EigenRF.
Due to the inherent unpredictability and irreducibility of random errors, whereas systematic errors exhibit regular patterns and are amenable to reduction or elimination, the process of eliminating experimental errors primarily involves addressing systematic errors induced by batch effects or injection sequences,9,10 also known as “signal drift correction” or “quality control (QC) sample correction” since QC samples can be utilized for signal drift correction. Researchers often refer to integrating this procedure with data standardization as “normalization”, as these two processes are complementary. Certain signal drift correction methods can even substitute the standardization process,3 enabling data to be scaled to a comparable level while eliminating batch effects. The normalization process in metabolomics data processing has evolved from a traditional concept of simple data scaling to a more focused approach to eliminating batch effects and correcting signal drift. Currently, there are four main types of normalization methods applied to metabolomics: (1) QC sample-based methods, (2) internal standard (IS)-based methods, (3) biological sample-based methods, and (4) combinations thereof.11 Therein, biological sample-based methods identify and estimate systematic errors directly from the signal expression matrix of biological samples without the assistance of internal standards or QC samples. Karpievitch et al. formed the EigenMS algorithm by incorporating the singular value algorithm (SVA) and initially applying it to proteomics,12,13 later extending it to metabolomics in 2014.14 Like SVA, EigenMS captures systematic errors through singular value decomposition and subtracts them from the original data to obtain normalized data. Compared to EigenMS, the SVA requires surrogate variables to predict trends in systematic deviations and includes them as covariates in downstream statistical inference. EigenMS exhibited high accuracy and reproducibility.15 However, it assumes a linear relationship between the biological variations of interest and sample groups, disregarding their potential nonlinear nature. Consequently, this limitation may result in some biological variations of interest being delimited into the residual matrix and eliminated as systematic errors.
Hence, we have improved the EigenMS method and proposed it as EigenRF. EigenRF comprises two steps in the normalization process. Initially, EigenMS is utilized to estimate and eliminate systematic errors from the data. Subsequently, a random forest regression model is employed to estimate the nonlinear biological variations of interest, which are reintegrated into the data. Random forest, a decision tree-based algorithm, excels at capturing nonlinearities in data,16,17 complementing the linear model employed in EigenMS. Furthermore, novel metrics were introduced to evaluate the reproducibility of importance rankings of differential metabolites in group discrimination, including two scores: the local consistency (LC) score, which evaluates the consistency of local rankings of differential metabolites, and the overall difference (OD) score, which evaluates the difference in overall rankings of differential metabolites. We applied EigenRF to three datasets, and the results demonstrated improved accuracy and reproducibility, thereby emphasizing its superiority over other methods.
BCPUM: a urine metabolomics study of the bladder cancer population.18 The dataset consists of 982 features and 311 samples, representing high dimensionality with a relatively small sample size. It includes 128 biological samples from the TG, 120 from the CG, and 63 QC samples. The samples were run in 12 batches, each containing approximately 5 QC samples (with a 5-biological-sample interval) and 20 biological samples, with the two groups of biological samples randomly ordered. Features with missing values exceeding 20% in QC samples and exceeding 50% in both groups of biological samples were removed. Afterwards, 661 features remained, and missing values were imputed with the mean values of each group.
GCPPM: a plasma metabolomics study of gastric cancer population.4 The dataset comprises 354 features (including 6 internal standards) and 615 samples, representing low dimensionality with a relatively large sample size. It includes 60, 197, and 240 biological samples from groups A, B, and C, respectively, and 120 QC samples. The samples were run in 7 batches, each containing approximately 17 QC samples (with a 5-biological-sample interval) and 70 biological samples, with the three groups of biological samples randomly ordered. The column and injector were adjusted by injecting adjusted QC samples and blank samples at the beginning of each batch to minimize systematic errors.
ACPPM: a plasma metabolomics study of adenocarcinoma population.19 The dataset consists of 6402 features and 729 samples, representing high dimensionality, a large sample size, and an imbalance in the number of samples. It includes 571 and 73 biological samples from CRC and CE, respectively, and 85 QC samples. The samples were run in 4 batches, each containing approximately 25 QC samples (with an 8-biological-sample interval) and 192 biological samples, with the two groups of biological samples randomly ordered.
The detailed procedures of EigenRF are as follows:
First, we assume that an additive model can represent the log-transformed feature signal values:
yi = Xlβli + WEi + εi, | (1) |
Next, the linear model of the EigenMS algorithm is employed to fit the biological variations of interest and obtain a residual matrix:
R = Y − lm(G), | (2) |
Subsequently, singular value decomposition is performed on the residual matrix R, like that of EigenMS:
R = UDV′, | (3) |
R = BV0 + ε, | (4) |
Further, the estimated systematic errors matrix WE is the estimated value of BV0:
(5) |
WEi = Xni + Wri, | (6) |
A random forest regression model is employed to fit the nonlinear biological variations of interest:
Xni ∼ Φi(G,I−i), | (7) |
Finally, the systematic errors are eliminated from the feature, that is, subtract the systematic errors captured by EigenMS, and add the nonlinear component of the biological variations of interest estimated by the random forest regression model, then the normalized value is obtained:
(8) |
Two scores are proposed here to evaluate the reproducibility of the screened differential metabolites, as shown in Fig. 1B. First, the new dataset is extracted by stratified sampling at a ratio of 3:1 from the normalized dataset. This process is repeated five times with replacement in the same ratio, resulting in five new datasets. For each new dataset, differential metabolites are screened according to the criteria of variable importance in projection (VIP) ≥1 and false discovery rate (FDR)-adjusted p-value <0.05. These metabolites are then ranked primarily in descending order of VIP and in ascending order of the p-value when VIPs are the same. The top 100 metabolites from each of the five new datasets are selected, and the LC score and the OD score are calculated as follows:
(9) |
(10) |
(11) |
(12) |
The LC score measures the consistency of local rankings, and the OD score measures the difference of overall rankings. The higher the LC score and the lower the OD score, the higher the reproducibility is.
EigenRF was further compared with EigenMS and ten other normalization methods (Table S1†) on the low-dimensional and relatively large sample size GCPPM dataset. These ten common normalization methods include four QC sample-based methods, three IS-based methods, two biological sample-based methods, and one combination method. All methods decreased the RSDs of QC samples compared to the original log-transformed data (Fig. S3A†), and except for RLSC and ber, the median RSD of each batch of QC samples was also decreased by the remaining methods (Fig. S3B†). Moreover, except for the RSC method, the normalized data obtained from the other methods exhibited no significant signal drift (Fig. S4A†) or within-batch clustering tendencies (Fig. S4B†). EigenMS and EigenRF demonstrated slightly superior performance to the other methods in the run plots and the batch-based PCA plots. In the group-based PCA plots (Fig. S4C†), most methods could not effectively distinguish the QC, A, B, and C samples, except for EigenMS and EigenRF, which provided clear distinctions among these groups. Similarly, in the group-based OPLS-DA score plots (Fig. S4D†), the performance of the other methods was suboptimal, with only EigenMS and EigenRF being able to fully distinguish among the A, B, and C samples. Based on the ROC curves depicted in Fig. 2D, the mean AUC values obtained from log-transform, RLSC, RSC, SVR, SERRF, SIS, NOMIS, CCMN, ber, WaveICA, ISWSVR, EigenMS, and EigenRF were 0.682, 0.739, 0.683, 0.679, 0.761, 0.700, 0.724, 0.707, 0.806, 0.729, 0.712, 0.975, and 0.993, respectively. EigenRF demonstrated superior classification performance, surpassing EigenMS and significantly outperforming ber, which performed best among the other ten normalization methods. However, the remaining methods demonstrated no significant effect. The reproducibility scores illustrated in Fig. 2E and F indicate that, with the exception of SERRF, NOMIS, and ISWSVR, all other methods improved the LC score over the log-transformed data. On the other hand, all methods except SERRF, CCMN, and ISWSVR showed a decrease in the OD score over the log-transformed data. The EigenRF method achieved the highest LC score of 73 and the lowest OD score of 7.122, outperforming EigenMS with an LC score of 65 and an OD score of 7.296, and also significantly higher than the subsequent LC score of 56 for WaveICA and lower than the subsequent OD score of 7.506 for RSC. In general, EigenRF outperformed other methods in eliminating systematic errors and preserving biological variations of interest, and its reproducibility in differential metabolites screening was also excellent.
All methods exhibited overall lower RSDs of QC samples compared to the original log-transformed data (Fig. 3A). However, RLSC and ber performed poorly in terms of median RSDs across almost all batches of QC samples, while the median RSD of each batch of QC samples for the other methods was considerably lower than that of the log-transformed data (Fig. 3B). Among them, EigenRF exhibited slight improvements in both RSD metrics compared to EigenMS. The signal drift manifested in the run plots was moderately reduced through normalization (Fig. 4A and S5A†), as the feature signals of the QC, CRC, and CE samples became relatively smooth after being normalized. Additionally, there were no apparent within-batch clustering tendencies in QC, CRC, and CE samples after normalization of EigenMS and EigenRF (Fig. 4B and S5B†). SERRF demonstrated slightly lower effectiveness in eliminating batch effects than EigenMS and EigenRF. In contrast, other methods only achieved satisfactory results across certain sample groups. For instance, ber and WaveICA showed distinct batch clustering effects in QC samples, while RLSC, RSC, and SVR exhibited such effects in CRC samples, and RLSC and RSC illustrated clear batch clustering in CE samples. Furthermore, the group-based PCA plots (Fig. 4C and S5C†) showed that the classification trends were only fully apparent after normalization with EigenMS and EigenRF, clearly distinguishing the QC, CRC, and CE samples. Other methods did not separate these three groups of samples. This indicates that the batch effects and signal drift present in the original data introduced deviations in PCA, which could be effectively eliminated through normalization with EigenMS and EigenRF. The OPLS-DA score plots of the CRC and CE samples also reflected changes in classification trends before and after normalization of EigenRF and other methods (Fig. 4D and S5D†), highlighting the advantages of EigenRF. Differentiated metabolites between the CRC and CE samples were examined to investigate the impact of normalization on feature selection. As shown in Fig. 5A, the mean AUC values obtained from different normalization methods differed from each other as well as from that of the log-transformed data, indicating that normalization methods can affect the results of feature selection. Among the methods evaluated, EigenMS and EigenRF significantly improved classification accuracy, achieving mean AUC values of 0.856 and 0.921, respectively. The remaining methods did not show substantial improvement compared to the log-transformed data. In fact, the mean AUC values of several normalization methods were lower than that of the log-transformed data, such as RSC, SVR, and ber. This indicates that EigenRF not only considerably outperformed the other methods but also enhanced performance based on EigenMS. Considering the imbalanced nature of the dataset, the PR curve was utilized as an additional evaluation metric (Fig. S6†). The results indicate that the log-transformed data exhibited poor classification performance for CE samples, with a mean AUC value of 0.183, suggesting that it struggled to identify CE samples accurately. Furthermore, methods such as RLSC, RSC, SVR, ber, and WaveICA showed little to no improvement overall. SERRF showed a moderate improvement, with a mean AUC value of 0.350. In contrast, EigenRF demonstrated significant enhancement for log-transformed data, effectively identifying CE samples with a mean AUC value of 0.812, surpassing the mean AUC value of EigenMS at 0.624, as well as other methods. In the reproducibility score plots from Fig. 5B and C, the LC scores for RLSC, SVR, SERRF, EigenMS, and EigenRF were markedly higher than that of the log-transformed data, with EigenRF achieving the highest score of 39, followed by EigenMS with a score of 35. In comparison, RSC, ber, and WaveICA only slightly increased the LC score above that of the log-transformed data. Regarding OD scores, RSC and ber were only slightly lower than the log-transformed data, while RLSC, SVR, SERRF, and WaveICA showed a moderate decrease. Notably, EigenMS and EigenRF significantly decreased the OD score of the log-transformed data, with scores of 10.336 and 9.503, respectively. As a result, the reproducibility of differential metabolites screening has been substantially improved with EigenRF.
Fig. 3 Cumulative frequency distribution of the RSDs of QC samples (A) and the median RSD of each batch of QC samples (B) in the ACPPM dataset. |
Fig. 4 Run plots of the first feature (A), batch-based PCA plots of the QC, CRC, and CE samples (B), group-based PCA plots (C), and group-based OPLS-DA score plots (D) in the ACPPM dataset. |
In summary, the EigenRF normalization method can effectively eliminate batch effects and signal drift in original data and improve the accuracy and reproducibility of the feature selection within the imbalanced dataset.
EigenRF was extensively tested and evaluated on the high-dimensional and relatively small sample size BCPUM dataset, as well as the low-dimensional and relatively large sample size GCPPM dataset, in comparison with EigenMS and other commonly used normalization methods. The results indicate that EigenRF exhibited significant advantages over EigenMS and other normalization methods in terms of AUC values and reproducibility scores (Fig. 2A–F). This suggests that EigenRF is capable of identifying features that are truly relevant to biological questions from the raw data while ensuring result reproducibility, thereby enhancing the reliability of the findings. It provided substantial evidence for the existence of nonlinear variation and non-negligible impact on biological variations of interest. From a visual perspective, the run plots generated by EigenRF showed the most compact and stable signal values (Fig. S2A and S4A†). Compared to EigenMS, EigenRF exhibited slightly different signal value levels, indicating that EigenRF successfully captured the component of nonlinear biological variations, which was an intuitive manifestation of the enhanced effects. EigenRF outperformed other normalization methods in PCA and OPLS-DA score plots (Fig. S2B–D and S4B–D†), indicating superior utility. Furthermore, the PCA and OPLS-DA score plots of EigenRF were fairly similar to those of EigenMS, which reflects the limitations of such traditional metrics used to evaluate normalization algorithms due to the nature of the linear dimensionality reduction technique. The EigenRF algorithm has been applied to the high-dimensional, large-sample, and imbalanced ACPPM dataset, and it has continued to maintain a leading advantage over other methods in various aspects, including AUC values of ROC and PR curves (Fig. 5A and S6†), as well as reproducibility scores (Fig. 5B and C). This further validates its broad applicability and robust normalization capability when handling complex datasets.
In this study, we observed differences in the performance of the EigenRF method across various datasets. These discrepancies may arise from several factors: (1) Dimensionality and sample size of the dataset: EigenRF demonstrated advantages when handling the low-dimensional and relatively large sample size GCPPM dataset compared to the high-dimensional and relatively small sample size BCPUM dataset. This can be attributed to the fact that the number of features in the BCPUM dataset far exceeds the number of samples, leading the model to be more susceptible to noise in the high-dimensional space, making batch effects more pronounced. Additionally, the insufficient sample size makes it difficult to accurately capture the overall distribution of the data, thereby affecting the normalization effectiveness. In contrast, the relatively large sample size of the GCPPM dataset allows for a better reflection of the true data structure, facilitating easier identification and elimination of batch effects. (2) Imbalance in the dataset: In the ACPPM dataset, the imbalance in sample distribution may have impacted the performance of EigenRF. When dealing with such imbalanced data, EigenRF might struggle due to insufficient data for certain groups, which could hinder the effectiveness of normalization methods across different groups and reduce classification ability. However, since the total sample size of ACPPM is large, its performance across various metrics remains satisfactory. (3) Types and degrees of systematic errors: Different datasets may be influenced by various systematic errors, including sample handling and storage discrepancies, laboratory differences, and instrument variations. The superior performance of EigenRF on certain datasets may be due to its effective identification and correction of these specific systematic errors. (4) Algorithm adaptability: The EigenRF algorithm integrates both linear and nonlinear models, enabling it to adapt to the characteristics of different datasets. In some datasets, the prominence of nonlinear patterns may be more pronounced, thus favoring the advantages offered by EigenRF in those contexts. Overall, the differences in the performance of the EigenRF method across various datasets reflect its adaptability and flexibility in addressing different types and complexities of systematic errors. These findings underscore the importance of selecting appropriate normalization methods in metabolomics research.
Normalization can significantly alter the numerical values of the original data, potentially affecting downstream analysis results. Evaluating the performance of normalization methods using different evaluation metrics aims to assess their capabilities comprehensively. Typically, metrics such as RSD and run plots are used to evaluate the performance of normalization methods.4 Theoretically, the signal values of QC samples should be identical for a given feature. In general, higher RSD values of QC samples indicate more unwanted variations introduced during the experimental process and lower reproducibility of the results.22 To identify the systematic errors caused by batch and injection order, it is more appropriate to calculate the median RSD for each batch and compare the values. In our calculations across three datasets, we found that log transformation is sufficient to decrease the RSD of QC samples. This reflects both the good quality of the original data and the limitation of RSD as an indicator to demonstrate the ability of normalization methods to eliminate batch effects. Run plots provide a more intuitive visualization method to showcase batch effects and signal drift in the data, facilitating comparisons among different normalization methods. As demonstrated by the run plots generated on three datasets, the trend of feature signal values of QC samples normalized by EigenRF concentrates around a horizontal line, indicating its superior stability compared to other methods. Although RSD and run plots are widely used metrics for evaluating normalization method performance, they primarily focus on internal variability and fall short of comprehensively assessing the impact of normalization on result reproducibility. Furthermore, a lower RSD of QC samples does not necessarily indicate that a normalization method is optimal. Effective normalization methods should not only mitigate systematic errors but also preserve biologically relevant variations associated with the biological questions of interest. Some normalization methods may effectively decrease the RSD of QC samples but could excessively smooth the data, leading to the loss of critical biological variations of interest. In such cases, while the RSD of QC samples might be low, the actual analytical outcomes could be inaccurate.
Considering that normalization methods could significantly impact downstream analysis involving classification visualization and feature selection, we evaluated the performance of normalization methods by classification accuracy and reproducibility of feature selection. An excellent normalization method should preserve the biological variations of interest and maintain high reproducibility of feature selection while ensuring high classification accuracy. Classification accuracy is commonly used to evaluate the ability of normalization methods to preserve the biological variations of interest since preserving these variations allows the data to exhibit the inherent classification trends of biological samples. However, there is currently no universally recognized metric to evaluate the reproducibility of the results. To address this issue, we proposed the metrics to evaluate the reproducibility of importance rankings of differential metabolites for group discrimination. Specifically, the LC score was used to evaluate the consistency of local rankings and the OD score was used to evaluate the difference of overall rankings. Including reproducibility evaluation metrics is necessary as low reproducibility can undermine the reliability of research findings, even if accuracy is high. Our findings indicate that the LC and OD scores can provide deeper insights. The LC score reveals the effectiveness of normalization methods in preserving key biological variations of interest by evaluating the consistency of local rankings, while the OD score reflects the normalization method's ability to maintain the overall structure of feature importance data by measuring the difference of overall rankings. EigenRF effectively decreased the RSD of log-transformed data, though it was not the method with the lowest RSD among all normalization methods (Fig. S1A, B, S3A, B, 3A and B†). However, EigenRF demonstrated optimal reproducibility scores (Fig. 2B, C, E, F, 5B and C). Conversely, while significantly reducing RSD, some methods such as RSC, SERRF, WaveICA, and ISWSVR exhibited poor reproducibility scores on certain datasets, indicating a loss of experimental result reliability that could mislead conclusions. This suggests that reproducibility scores can evaluate the performance of normalization methods more comprehensively, especially for complex biomedical data. Therefore, we recommend considering LC and OD scores alongside RSD when assessing normalization methods for a more accurate and multi-perspective performance assessment.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4ay01569j |
This journal is © The Royal Society of Chemistry 2025 |