Louise Malma and
Anneli Kruve
*ab
aDepartment of Materials and Environmental Chemistry, Stockholm University, 11418 Stockholm, Sweden. E-mail: anneli.kruve@su.se
bDepartment of Environmental Science, Stockholm University, 11418 Stockholm, Sweden
First published on 8th July 2025
Retention time (RT) is essential in evaluating the likelihood of candidate structures in nontarget screening (NTS) with liquid chromatography high resolution mass spectrometry (LC/HRMS). Approaches for estimating the RTs of candidate structures can broadly be divided into projection and prediction methods. The first approach takes advantage of public databases of RTs measured on similar chromatographic systems (CSsource) and projects these to the chromatographic system applied in the NTS (CSNTS) based on a small set of commonly analyzed chemicals. The second approach leverages machine learning (ML) model(s) trained on publicly available retention time data measured on one or more chromatographic systems (CStraining). Nevertheless, the CSsource and CStraining might differ substantially from CSNTS. Therefore, it is of interest to evaluate the generalizability of projection models and prediction models in CSs routinely applied in NTS. Here we take advantage of the recent NORMAN interlaboratory comparison where 41 known calibration chemicals and 45 suspects were analyzed to evaluate both the projection and prediction approaches on 37 CSs. The accuracy of both approaches was directly linked to the similarity of the CS, and the pH of the mobile phase and the column chemistry were found to be most impactful. Furthermore, for cases where CSsource and CSNTS differ substantially but CStraining and CSNTS are similar, prediction models often performed on par with the projection models. These findings highlight the need to account for the mobile phase and column chemistry in ML model training and select the prediction model for RT.
Regarding empirical analytical information, retention time (RT),5–8 collision cross-section (CCS),9–11 ionization mode, and adduct type12,13 have proven useful for discriminating candidate structures. Nevertheless, we14 and others4,15 have previously observed that RT has high potential in distinguishing between the candidate structures in terms of physical separation during analysis as well as evaluating candidate structures with a prediction model. This arises from the orthogonality of the chromatographic separation and HRMS data and indicates the high importance of incorporating RT in candidate structure prioritization workflows in NTS.
Comparing the RT of the detected LC/HRMS feature and the candidate structure can aid in prioritization by indicating less likely candidate structures, e.g., the candidate structures for which the predicted RT differs from the experimentally observed RT more than the uncertainty limits. The RT of the candidate structures can be obtained by (1) projection of experimental reference retention times and (2) prediction of retention times based on the structure with machine learning (ML). The projection methods leverage RT databases16,17 of known structures measured on a different chromatographic system (CSsource)5 and is applicable to chemicals already experimentally studied. Nevertheless, the CSsource and chromatographic system used in NTS (CSNTS) may differ due to equipment (dead volume, flow rate, and column length) as well as separation mechanisms (column chemistry and mobile phase). A small set of ten to 50 chemicals measured on both CSs can be used to fit a generalized additive model (GAM) or similar model between RTs from the source and target CS, which can account for some of these differences. Later, the same model is applied to project the RT of the chemicals measured with the CSsource to the CSNTS.
In the initialization and validation of a well-known projection approach, PredRet,5 it was observed that the number of chemicals commonly analyzed by both CSs and the similarity of the retention mechanism (reversed phase vs. HILIC) impact the projection accuracy. Alternatively, retention time indices (RTIs)6,18,19 have been suggested to account for the effect of column length, flow rate, and dead volume on RT. Furthermore, retention time order (RTO) is known to be more stable than RT or RTI if the same physical parameters of the CS are changed. Nevertheless, RTO is likely to change for nominally equal CSs for close eluting/co-eluting chemicals due to the variations in the peak shape or with different batches of LC columns20 as well as changing mobile phase composition when acids or bases are analysed.21 A recent comparison of RTOs of datasets collected into RepoRT16 revealed the largest variations for CSs with different columns and mobile phases. In spite of these considerations, an extensive unbiased overview of the applicability of projection methods across different CSs and laboratories is lacking.
The need for experimentally determined RTs can be overcome by ML models6,8,9,22–32 predicting RT, RTI, or RTO. Despite obvious advantages, the prediction approach is presumed to underperform projection approaches. Such approaches require a relatively large and representative33 dataset for model training and may require a close match of the CS used for collecting training data (CStraining) and the CSNTS or conversion of RT to RTIs with a predetermined calibration mix.6 To overcome this limitation, a complementing prediction approach with projection5 was suggested by Bouwmeester et al.34 and similar approaches have been successfully used thereafter.24,35 These approaches project the predicted RT or RTI to CSNTS based on predicted RTs of experimentally measured standards even if commonly measured chemicals are lacking. Nevertheless, a recent comparison by Kwon et al.36 suggested a large performance variability and generally low performance of the combination of ML and projection methods. Though the exact reason remained unclear, it can be hypothesised that chemical space overlap and CS similarity affect the applicability. Similar to the study by Kwon et al.36 benchmarking different RT, RTI, or RTO prediction methods on public datasets has been widely discussed;28 however, the contribution of variability in chemical space and CSs remains hard to evaluate. In part, this is due to the sparse overlap in the chemical space considered in different studies leading to a set of data where both the chemical space and CSs often vary simultaneously.
Here we leverage the recent NORMAN interlaboratory comparison37 to evaluate the performance of the RTI projection and prediction approaches across different CSs commonly used in NTS for environmental analysis, while keeping the chemical space constant. In particular, we are interested in answering the following questions: (1) how different are the CSs in terms of agreement in RT, RTI, and RTO; (2) how do CSs affect the accuracy of the projection methods and how do this confine the best practices in NTS; and (3) how accurate are the prediction models in comparison with projection models. To answer these questions, firstly, we used the calibration chemicals to establish a GAM for the projection of RTIs from the CSsource to the CSNTS and evaluate these on the suspected chemicals. Secondly, we trained an ML model to predict the retention time from the structure of the suspected chemicals and evaluated its performance based on the root mean square errors (RMSEs). Lastly, we compared the projection and prediction approaches and contextualized the findings based on the similarity of the CSsource and CStraining to the CSNTS.
![]() | (1) |
RTIs enabled accounting for differences in the flow rate and column length and thereby unifying the range of RTI values; however, the RTIs were sensitive to nonlinear gradient programs, that is, plateaus and changing gradient speeds. To account for such variations in the CSs, the RTI projection with a generalized additive model (GAM) was implemented. That is, the gam function from the mgcv package in R with smooth function “s” was used and the dimensions k for the smooth term was set to 6 for all projections. The GAMs were fitted on the calibration chemicals between the CSNTS and CSsource, and applied to predict the RTIs in the CSNTS for the suspects based on the “known” RTIs in the CSsource. Only suspects eluting in the calibration range were considered (RTI range 0 to 1000). Notably, the GAM minimizes the residual in the direction of the y-axis (like most models), leading to unsymmetrical projection accuracies and therefore all-to-all CS combinations were considered.
The similarity in the retention time profiles of CSs was evaluated based on the correlation of the RTs with Pearson correlation coefficients and RTOs with Spearman correlation coefficients. Generally, the agreement in RTs for two CSs (Pearson correlation) followed the same order as that based on RTOs (Spearman correlation), with a vast majority of the CS pairs showing R2 > 0.8. This is expected as both the correlation of RT and RTO values have their advantages and disadvantages. RTOs are less affected by segments of isocratic elution, while RTs are less affected by small differences in the retention time for close eluting or overlapping chemicals. Due to the good agreement of the Pearson and Spearman correlation coefficients (ESI 1†), only the RTO agreement (Spearman correlation) is discussed below.
The Spearman correlation coefficient of RTO values ranged from 0.59 to 0.999 (ESI 1†). Independent of the type of correlation coefficient, the highest similarity was observed for the two pairs of CSs where both CSs were implemented in the same lab, by making use of either a different LC/HRMS instrument (DS_QDF and DS_VQL) or the HRMS mode only (DS_GSB and DS_TDF). The latter is expected to yield different RTs only due to the random variability of the analysis and data processing.
The highest correlation coefficients for CSs implemented in different laboratories (DS_HT and DS_QBD) exceeded a Spearman ρ value of 0.999. These CSs used the same LC column (Agilent ZORBAX Eclipse Plus C18) and mobile phase (water and acetonitrile with 0.1% formic acid). Nevertheless, the CSs differed in LC instrumentation, gradient program, column temperature, injection volume, etc. This indicates that the retention order may be well preserved, given the same column chemistry and mobile phase, even if other parameters vary.
The highest Spearman ranked correlation coefficient for CSs (DS_BQW and DS_MT) using different columns from different vendors yielded only a marginally lower Spearman ρ value of 0.998; however, both CSs made use of a very similar mobile phase containing ammonium formate in both mobile phase components. This indicates a good agreement in RTOs despite having different CSs. However, some pairs of CSs leveraging the same retention mechanism yielded Spearman ρ values below 0.9.
Further analysis with hierarchical clustering revealed several clusters of CSs with similar elution profiles that could be associated with the CS parameters. Firstly, one of the CSs showed substantially different RTOs in comparison with all other CSs. A closer investigation revealed that this was the only CS that combined RP and HILIC columns for separation (DS_DX), and the significant differences in the RTI order clearly indicate the orthogonality of this system. Due to the significant differences in the retention mechanism of RP and HILIC, only RP based CSs will be considered further in evaluating the projection or prediction method as both approaches assume the same retention mechanism.
Among the CS using RP-LC, a cluster of five CSs, namely DS_JL, DS_QJS, DS_AWW, DS_JBQ, and DS_JSG, stands out. Furthermore, RTOs from DS_QQT showed a high correlation with the RTOs from the above CSs (Fig. 1). All these CSs made use of ammonium salt as a mobile phase additive either alone or with formic acid. As the majority of the CSs used in this interlaboratory comparison made use of 0.1% formic acid as a water phase additive, the CSs in this cluster use a higher pH of the water phase compared to the majority of the CSs. Comparison of the RTOs from these CSs with RTOs from CSs using formic acid as a mobile phase additive revealed that higher RTOs were exclusively observed for chemicals with basic functional groups (pKa values between 3.6 and 8.1) with ammonium formate as a mobile phase additive. Similarly, lower retention time order values were observed for acidic chemicals, suggesting the importance of the acid–base equilibrium of the analytes in the RTO. In addition to the above-mentioned CSs, three additional systems leveraged ammonium salts as mobile phase additives, namely DS_STF, DS_GJT and DS_QSB. The RTOs from DS_STF and DS_GJT showed lower similarity to all other CSs, possibly due to the use of a different stationary phase. DS_STF used a biphenyl column, while DS_GJT used a C8 column. On the other hand, DS_QSB clustered together with CSs facilitating formic acid as a water phase additive.
![]() | ||
Fig. 1 The heatmap of the CSs based on the Spearman correlation coefficient of the RTOs. Red indicates high and blue indicates low Spearman correlation coefficients. |
In two of the clusters of CSs (clusters of DS_JWW, DS_AW, DS_JDF, DS_QDF, DS_VQL, DS_QQG, DS_ QV; and cluster of DS_BIJ, DS_TL, DS_QSB, DS_HT, DS_QBD, DS_DBS, and DS_TSF) leveraged exclusively 0.1% formic acid as the water phase additive. Several CSs in the first of these clusters used Agilent Eclipse Plus or Poroshell columns, while in the second Phenomenex Kinetex EVO columns were used in several CSs. This is likely to indicate that in addition to the mobile phase the stationary phase chemistry has an impact on the agreement of the RTOs even if nominally all CSs considered here applied RP separation.
Importantly, the similarity in RTOs is impacted by the chemicals commonly detected by the two CSs. For example, if one of the LC/HRMS methods has not detected some of the acidic or basic compounds that are likely to change RTOs, such an effect will be left undetected here. Nevertheless, the cluster analysis demonstrates that the RTOs largely depend on the similarity of the CSs and the impact on the performance of projection and prediction methods needs close scrutiny.
In order to evaluate the peak spacing of CSs used here, RTs were normalized relative to the length of the elution program. The peak spacing was analyzed based on the standard deviation and cumulative distribution of the normalized RTs. It was observed that in some CSs the chemicals eluted almost equally spaced across the whole elution program (Fig. 2, green line). The highest standard deviation of the normalized and non-normalized RTs was observed for CS DS_TSJ, which made use of a relatively long elution program and the duration of the gradient segment exceeded 50% of the total elution program (run time).
Simultaneously, with some CSs a narrow range of normalized RTs was observed. For example, for CS DS_QV (Fig. 2, red line) most of the chemicals yielded RTI values between 200 and 600. This CS has long isocratic elution segments at the beginning and end of the elution program while the gradient segment is relatively short in comparison with the length of the elution program (<40%). The absolute peak spacing based on the distribution of the normalized RTs showed that CS DS_QV yielded one of the lowest RT standard deviations. The distribution of the normalized RTs is therefore also affected by the polarity of the chemicals included in the study. For example, it is fair to believe that incorporating more (very) low and high polarity chemicals would benefit this CS due to the isocratic elution segments. Nevertheless, the chemicals used in this study are chosen as representatives of the chemical contaminants in environmental water samples.
Furthermore, for CS DS_TSF (Fig. 2, yellow line) and many others, the majority of the normalized RTs remained below 400, hinting at a very tight spacing of the peaks. Simultaneously, for CS DS_MT the detailed analysis revealed that the narrow range of normalized RTs is also associated with the low number of detected chemicals. Namely, most of the late eluting chemicals incorporated in this study were undetected with the CS DS_MT, which impacted the analysis of the normalized RTs. The same was true for other CSs yielding the lowest maximum normalized RTs. Nevertheless, no direct association between the standard deviation of the normalized RTs and the number of detected chemicals was observed.
As a result, the analysis of peak spacing, retention time distribution and the peak detectability suggests that longer elution programs with proportionally larger segments of linear gradients provide a more uniform peak distribution. Importantly, this coincides with the insights from Anderson et al.,38 who found that longer gradient programs increase the number of chemical features detected with LC/HRMS based NTS. It can be simultaneously hypothesized that such elution programs also improve the identification of the detected chemicals based on the RTs due to the expected narrower relative uncertainty.
The necessity for the GAM for RTI projection results from the fact that RTI calculation, as proposed earlier,6 makes use of the first and last eluting calibration chemicals and is unaffected by the rest of the calibrants and their RT profiles. Thus, the RTI values are a linear projection of the RTs and overlook any nonlinearity arising from the different gradient profiles, including isocratic segments, used by different CSs. For example, the two CSs exemplified in Fig. 3 used BEH C18 columns and methanol as an organic modifier; however, they differed in the flow rate and organic modifier gradient as well as the position and length of the isocratic elution segment. As a result, the retention order of the chemicals and Spearman correlation of the RTIs from the two CSs are high. Nevertheless, the RTI values are nonlinearly associated, leading to an RMSE of 257.2 RTI units without the GAM. Here the GAM enables an effective projection of the RTIs from one system to another, and the RMSE drops to 55.0 RTI units.
We furthermore observed that the projection of the GAM was very sensitive to the overlap in the calibration chemicals detected by the two CSs. Most of all, the projections were unreliable for suspects outside of the calibration range (lower or higher RTs of suspects than those of calibrants), and all such suspects were omitted in the above analysis. As a result, the projection results are unavailable for 9.8% of the detected suspects across all combinations of CSs. In this sense, it is important to ensure that the calibrants elute across the eluent program.
Lastly, we were also interested in the projection accuracy of the calibrants that can be used to assess the prediction accuracy of the suspects. We therefore compared the RMSE values of the calibrants and suspects for each combination of the CSs. Ideally, the RMSE values would be statistically insignificantly different. Here we observed a weak correlation of the RMSE values of the calibrants and suspects (Pearson's squared correlation coefficient of 0.30). In 63 cases, the RMSE of the suspects was statistically significantly lower than that of the calibrants with a maximum difference of 4.7×. The RMSE was statistically significantly higher for suspects with 535 of the 1332 combinations of CSs and the maximum relative increase in the RMSE was over two orders of magnitude. As a consequence, the RMSE values of the calibrants are insufficient in evaluating the projection accuracy for the suspects for a large fraction of the combinations of CSs. As a result, we suggest evaluating the projection accuracy with a separate set of quality control chemicals that are representative of the chemicals expected to be detected in the samples but are not used for fitting the GAM. This will allow for an independent RMSE evaluation, which can be further used as the uncertainty of the projection approach. For example, given the chemical similarity between the quality controlled chemicals and suspected chemicals, ∼95% of the “true” RTs of the suspected chemicals are expected to be within ±2 RMSE from the RT obtained with the GAM.
The RMSE of suspects across all CSs was 149.0 RTI units, while the overall MAD was 150.3 RTI units and 95% of the chemicals yielded RTI values below 289.8 units across all CSs (Table S1†). The R2 values showed the same trends as RMSE values; generally CSs with lower RMSE values showed higher R2 values of RTI predicted and RTI measured. The lowest R2 value was 0.12 (DS_GJT) and the highest value was 0.82 (DS_AWW). The performance of the prediction models varied starkly depending on the CS (Fig. 4, ESI 3, ESI/results/NORMAN_RTI_prediction_results_summary.csv†): the RMSE values for the individual CSs ranged from 86.8 to 219.7 RTI units. It must be recognized that the RMSE values of suspects are influenced by (1) the suitability of the training data for the CSs and chemical space in question; (2) the accuracy of the model; and (3) the accuracy of the projection of the predicted RTs to the RTIs of the CS. The RMSE values of the calibrants are primarily impacted by the former two factors and less by the projection accuracy as the experimental RTIs are directly used for fitting the projection model.
We observed that the performance in terms of RMSE was directly correlated with the similarity of the CSNTS (CS used for measuring the suspects) and CStraining (used for collecting the training data). All four CSNTS on which the trained models showed the lowest RMSEs for the suspects used an ammonium formate based mobile phase. Similarly, the CStraining used a mobile phase containing ammonium formate; however, different columns and mobile phase flow rates were used. Furthermore, for these CSs relatively low RMSEs of the calibrants were observed, also indicating a successful projection of the predicted RTs to experimental RTIs. This is in accordance with the RTO correlations observed above, where a good agreement of the RTOs for ammonium salt containing mobile phases was observed independent of the column. As a result, it is expected that predicted RTIs can be successfully used on CSs using similar mobile phases.
Generally, a weak agreement in RMSE values of the calibrants and suspects was observed, where CSs yielding relatively low values for one set also yielded low values for the other and vice versa (Pearson's squared correlation coefficient of 0.47). Nevertheless, in a handful of cases the RMSE values of the calibrants were low, while relatively high RMSE values were observed for suspects. No concrete reasons for the substantially higher RMSEs of suspects could be pinpointed either in terms of chemicals detected or the settings of CSs. Nevertheless, the high RMSEs of the calibrants were usually indicative of high RMSEs of the suspects and could be considered in the interpretation of the results.
Recent findings by Kwon et al.36 demonstrated that RT models directly trained on a small dataset might outperform ML models combined with projection models; however, the reasons remained unclear. Inspired by these findings, we trained separate RTI prediction models for each CS using calibration chemicals only (ESI 4†). The RMSE values ranged from 131.5 to 351.3 RTI units and the performance of the models clearly correlated with the number of training instances, where the correlation coefficient between the RMSE and number of training datapoints was −0.63. The model trained on the calibration data from the same CS never outperformed the ML model trained on a larger dataset followed by projection. On the other hand, the ML models trained on external large datasets outperformed the models trained on the calibrant data from the same CS in 11 cases (F-test, p < 0.05). In the remaining cases, the performance of the two models was statistically insignificantly different. This indicates that given sufficient overlap in the chemical space, ML models trained on larger datasets followed by the projection of the predicted RTIs may be preferred to in-house trained models if only limited training data are available.
All in all, the application of ML predicted retention times in the data from 37 CSs indicates that the reliability of the predictions is closely linked to the similarity in the CSs used for collecting the training data and the CS used in NTS. This highlights the need for using CSs in NTS that are similar to community standards if the applicability of ML models for structural elucidation or other tasks is desired.
The prediction models outcompeted the projection models in seven CSs (DS_AWW, DS_DID, DS_JBQ, DS_JSG, DS_QJS, DS_QQT, and DS_YF). In these cases, with the exception of two CSs (DS_DID and DS_YF), the CSNTS mobile phase is similar to the mobile phase of CStraining (ammonium salt based buffer solution); therefore, the ML model for RTI prediction has been trained on data with a chromatographically similar mechanism. These five CSs also showed the lowest (top6) RMSE values for the prediction approach.
Furthermore, the comparison of projection and prediction methods is also impacted by the CSsource used in the projection approach. Here, most cases where the prediction method outcompeted the projection method are characterized by mismatching the mobile phase of the CSsource and CSNTS and therefore high RMSEprojection values of the suspects (between 103.7 and 207.7 RTI units). Nevertheless, for the CSsource and CSNTS using a similar mobile phase, the projection method outperformed the prediction approach even if the CStraining closely matched the CS used in NTS.
This indicates that in the case where the CS used in NTS closely matches the one with experimentally available RTI values, the projection method should be preferred. However, if experimental data with similar CSs are lacking, and the ML model used in the prediction approach is trained on data from a similar CS, the prediction approach could yield similar or higher prediction accuracy than the projection approach from a less similar CS. Moreover, the prediction approach is applicable to all chemicals within the scope of the ML, even the ones that lack previously measured chromatographic data.
Furthermore, two commonly applied approaches for evaluating RT(I)s for candidate structures, the projection and prediction approaches, could be compared. As expected from the RTO correlations, the performance of both approaches was affected by the similarity in the mobile phase composition. Thus, projection methods generally outperformed prediction models; however, prediction models were found to occasionally outperform the projection method if the prediction model was trained on similar CSs. Furthermore, the prediction models trained on larger external data should be preferred over small in-house data in spite of mismatching CSs. As a result, we suggest using projection methods when the CSsource closely matches with the CSNTS and in other cases, prediction models trained on large representative datasets should be preferred due to their accuracy and application scope.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5an00323g |
This journal is © The Royal Society of Chemistry 2025 |