Liwen Songa,
Jasmine G. Reeseb,
Michael A. Platt
a,
Claire Lewis
b,
Annabel S. J. Eardley-Brunt
a,
Bo Sun
b,
Olaf Ansorgeb and
Claire Vallance
*a
aDepartment of Chemistry, University of Oxford, Chemistry Research Laboratory, 12 Mansfield Rd, Oxford OX1 3TA, UK. E-mail: claire.vallance@chem.ox.ac.uk
bAcademic Unit of Neuropathology, Nuffield Department of Clinical Neurosciences, University of Oxford, John Radcliffe Hospital, Oxford OX3 9DU, UK
First published on 12th May 2025
The use of rapid mass spectrometry techniques, such as atmospheric-solids-analysis-probe mass spectrometry (ASAP-MS), in the analysis of metabolite patterns in clinical samples holds significant promise for developing new diagnostic tools and enabling rapid disease screening. The rapid measurement times, ease of use, and relatively low cost of ASAP-MS makes it an appealing option for use in clinical settings. However, despite the potential of such approaches, a number of important experimental considerations are often overlooked. As well as instrument-specific choices and settings, these include the treatment of background noise and/or contaminant peaks in the mass spectra, and the influence of consumables, different users, and batch effects more generally. The present study assesses the impact of these various factors on measurement accuracy and reproducibility, using human brain and cerebrospinal fluid samples as examples. Based on our results, we make a series of recommendations relating to optimisation of measurement and cleaning protocols, consumable selection, and batch effect detection and correction, in order to optimise the reliability and reproducibility of ASAP-MS measurements in clinical settings.
Despite its promise for clinical usage, obtaining high quality and trustworthy data sets in ASAP-MS remains a challenge, often due to lack of a carefully standardised sample handling and measurement procedure. Considering the issue more broadly, in a survey involving 1576 scientists,12 over 70% disclosed difficulties in reproducing others’ experiments, and more than half faced challenges in repeating their own experiments.‡ Almost 90% of chemists participating admitted to experiencing failures in experiment replication. While not directly related to ASAP-MS, this study highlights the critical need for standardisation and careful experimental design. In the context of ASAP-MS this requires careful investigation and optimisation of a number of key experimental parameters, in order to enable the generation of trustworthy and reproducible clinical data sets, thereby enhancing the reliability of ASAP-MS for clinical applications.
In a previous study, we addressed standardisation issues in the measurement of human plasma using ASAP-MS.15 The present study builds on this earlier work to further optimise the methodology for generating meaningful and trustworthy clinical data sets, in this case on samples of human brain and cerebrospinal fluid. We investigate the influence on the mass spectra of calibration and post-calibration procedures, including the presence of residual calibration mix within the ion source;16 cooling of the probe tip following recording of background spectra; probe cleaning procedures; contamination from consumables such as lens tissue and sample storage containers;17 and variation between different instrument users. Lack of standardisation in any of these can lead to variations in the data in the form of contaminant peaks, changes in ionisation probability for some or all sample components, or sample degradation. If these variations are interpreted as significant features in classification models the result is significant skewing of results.
As part of the study, we also consider batch effects in some detail. As the name suggests, batch effects arise when samples are measured in different ‘batches’, resulting in systematic differences between subsets of data within a large data set.18–20 These have become more important in the age of big data.21 Some of the inter-batch differences may be reasonably easy to explain (e.g. batches of data from different labs or recorded using different instruments or different experimental protocols, or subjected to different data processing protocols). However, even if all experimental parameters are standardised as comprehensively as possible, some batch effects will usually remain. If these are not considered carefully and corrected for as far as possible, they can mask – or worse, mimic – biological variation, leading to highly misleading results. The conclusions of more than one study have been found to be invalid due to improper treatment of such effects; for example, Mertens et al.20 highlighted a number of clinical mass spectrometry studies in which poorly designed experiments resulted in the perfect confounding impacts of batch variation and biological variation. Although these experiments produced high accuracy rates in machine learning training, the affected data sets and conclusions were ultimately considered useless and abandoned. In the present study, alongside standardising the experimental protocol as far as possible, we evaluate several different methods for batch effect correction, and are able to make a series of recommendations in relation to optimised use of ASAP-MS for clinical data set generation.
Samples were prepared for analysis as follows:
1. Brain samples were acclimatised to −20 °C and mounted onto a cryostat block using optimal cutting temperature (OCT) compound, ensuring that cut sections were not contaminated with OCT. Three 10 μm sections were obtained from each sample, and transferred to a polypropylene sample tube (see later for details of the various tubes employed). To create a suitable medium for homogenisation, 100 μL of LC-MS grade water (Fisher) was added to the tube. A bead homogeniser (OMNI International bead ruptor elite) was then employed to thoroughly homogenise the sample (see Table S1 of the ESI† for homogeniser settings).
2. CSF samples were thawed from −80 °C and centrifuged at 12000g for 15 minutes at 4 °C to separate the cellular components and particulate matter from the liquid fraction. The supernatant was then pipetted into a new tube, taking care not to disturb the pellet.
All measurements were made on an Advion Expression version L compact quadrupole mass spectrometer equipped with an ASAP ion source and controlled by Advion Mass Express data acquisition software (version 6.9.38.1). The ion source was run in ‘high temperature, low fragmentation’ positive ion mode – see Table S2 of the ESI† for detailed instrument settings. Prior to measurements, the glass capillaries that comprise the tip of the ASAP probe were baked in an oven at 250 °C for 30 minutes to remove any surface contaminants as far as possible. To make a measurement, the probe was fitted with a clean glass capillary and inserted into the ion source for 30 s to record a background mass spectrum. The probe was then removed from the ion source and allowed to cool before being brought into contact with a small amount of sample and reinserted into the ion source for a measurement time of 25 s. Due to the high sensitivity of the instrument, in general the smallest possible amount of sample should be transferred to the probe tip for measurement. When the probe is inserted into the ion source this should result in an almost instantaneous rise in the total ion signal, followed by a rapid decay over the next 20–30 s. Too much sample leads to signal saturation, characterised by high signals that do not decay or strange time-dependent behaviour of the total ion signal. Examples can be found in a previous publication,15 in which we characterised this behaviour in detail.
Mass spectra were exported for analysis using Advion Data Express data manipulation software (version 6.9.38.1) and processed using custom Python software (Jupyter Notebook, available at https://github.com/Liwensong2019/ASAP-MS). The mass spectrometer completes a scan every 900 ms during the acquisition period, and the raw data files contain all of these scans stored in sequence. Some of these scans correspond to ‘background’ recorded when the probe was not present in the ion source, and some correspond to signal recorded during each of the 30 s intervals when the probe loaded with sample was inserted into the ion source. The Python script identifies these regions based on the rapid rise in the total ion count on each insertion of the probe, and calculates the average and standard deviation of the mass spectra recorded over each 30 s sample measurement. The script also generates a background spectrum in an equivalent way from the 30 s background measurement, and subtracts this from each of the ‘signal’ measurements. The resulting background-corrected spectra for each sample are then normalised to unit area under the spectrum, and averaged to generate a single mass spectrum for each sample.
The effects of a number of experimental factors were investigated:
1. Calibrant and background effects: the mass spectrometer is calibrated daily with Advion APCI calibration tuning mix in order to ensure repeatable peak positions and widths in the mass spectra. A considerable amount of tuning mix is injected through a capillary tube inlet into the ion source during the calibration process, and any residual mix can have a significant effect on both the ‘background’ and ‘sample’ mass spectra recorded subsequently. Fig. S1† shows examples of background spectra recorded in the presence and absence of residual tuning mix. To assess the impact of residual tuning mix on clinical sample data sets, a CSF sample prepared as detailed above was split into two. Each of the two sub-samples were subjected to 25 repeat ASAP-MS measurements, with the measurements on one sub-sample performed immediately after calibration when residual tuning mix was present, and those on the second sub-sample performed after allowing the instrument to run without sample introduction until a clean and stable background was observed. After analysing the repeatability of the two groups of measurements, two methods for removal of residual tuning mix from the ion source were investigated. These involved flowing either (i) air or (ii) a 1:
1 mix of LC-MS water and ethanol through the capillary inlet for 1.5 minutes, then running the instrument for 8.5 minutes to clear the ion source and allow the background signal to stabilise prior to making any measurements.
2. Temperature of ASAP probe tip: after each background measurement, during which the glass capillary that forms the tip of the ASAP probe is exposed to nitrogen gas at 400 °C, a period of cooling is required prior to sample loading and measurement. Cooling curves were measured using a thermal imaging camera (FLIR C3-X Compact Thermal Camera) immediately after removing the probe from the ion source (see Fig. S2 of the ESI† for a schematic of the experimental setup, and Table S3† for camera settings). The highest temperature region of the image was identified by the thermal camera's ‘hot spot’ function, and was used to define the measurement region corresponding to the probe tip. The temperature in the measurement region was recorded every second during imaging of the probe, and the results from ten repeat measurements were used to generate a cooling curve. The room temperature was 23 °C on the day of the measurement.
3. Glass capillary cleaning/reuse: in the interests of cost reduction and sustainability, we have established in previous work15 that with appropriate cleaning the glass capillary forming the tip of the ASAP probe can be reused for up to five measurements on the same sample without any negative impact on the measured mass spectra. Between measurements, the capillary is cleaned by rinsing with deionised water, followed by gentle wiping and drying with lens tissue. In the present work, we investigated whether the performance could be improved further by inserting the cleaned capillary into the ASAP source in order to expose it to the hot nitrogen gas flow, followed by removal and cooling for 20 s. The comparison was performed by analysing one frozen brain sample 25 times with the two different cleaning protocols, recording five repeats with five different capillaries for each protocol.
4. Consumables: to investigate the impact of different consumables on the mass spectra, we selected three different brands of polypropylene tubes (1.5 mL polypropylene tubes (Tube Brand 1, PCR clean, manufactured without slip agents, plasticisers, and biocides) were purchased from Eppendorf; 2 mL polypropylene tubes (Tube Brand 2, DNase- and RNase-free) were obtained from OMNI; 10 mL optically clear polypropylene tubes (Tube Brand 3, DNase-, RNase-, Endotoxin-, and Pyrogen-free) were sourced from Appleton) and three different brands of lens tissues (MC-5 Lens Tissues (Lens Tissue Brand 1) were purchased from THORLABS; additional lens tissues were obtained from KimTec (Lens Tissue Brand 2) and Fisher, UK (Lens Tissue Brand 3)). To investigate contaminant peaks in the mass spectra arising from the polypropylene tubes, we first used moderate force to swab the inner wall of the empty tubes with the glass capillary tip of the ASAP probe, and recorded mass spectra. To investigate any diffusion of materials from the tube to the solvent, we added LC-MS water to the tubes and left them overnight, before recording mass spectra of the resulting solutions. To investigate contaminant peaks arising from the lens tissues, we mimicked a standard cleaning procedure by gently wiping the glass capillary ASAP tip with each brand of lens tissue and then recorded mass spectra. To ensure consistent results, the measurements on each consumable were repeated five times.
5. Measurement repeatability between users: in order to investigate variations between different instrument operators in mass spectra recorded using a standardised protocol, we selected four representative users with varying scientific backgrounds and different levels of experience with ASAP-MS measurements. The four users made independent measurements on the same human cerebellum sample on the same day in a random order. We assessed both the repeatability within each individual user's measurements and the reproducibility across all four users.
The significance of any similarities and differences between the measurement groups was evaluated using various statistical methods. Since several of the data sets did not follow strictly normal distributions, for single-pair comparisons, we employed the Mann–Whitney U test23 within the SciPy (version 1.5.4) Python package24 (this package was also used for the Shapiro–Wilk test mentioned above). For multiple group comparisons, we used a two-step approach: first a one-way Analysis of Variance (ANOVA) to determine overall differences among groups, followed by Tukey's Honestly Significant Difference (HSD) test for pairwise comparisons.25 These tests were implemented using the Python statmodels package (version 0.13.5).26 See Table S4 of the ESI† for further details. Significance levels were set at p < 0.05, p < 0.01, and p < 0.001 for all tests.
1. All 12 samples were measured within a single day, generating a data set that serves as a control without batch effects;
2. The 12 samples were measured in two batches on two different days, generating a data set with the potential presence of batch effects. The sample distribution between the two batches was randomised in order to mitigate any bias, with each batch including samples from both brain regions of three different patients.
To examine the batch effects, PCA plots were generated for all the data points in each batch, with confidence circles centred at the centroid and extending to the furthest data point in each batch. The function Kruskal from SciPy (version 1.5.4) was used to perform a Kruskal–Wallis (kW) H-test in order to evaluate the statistical significance of correlations between the principal components (PCs) and the batches.24,27 This dual approach enabled us to rigorously investigate any potential batch-related discrepancies, ensuring the reliability of our findings. Having identified batch effects, we explored the efficacy of correcting for these using two different batch-effect correction tools available in Python, namely ComBat and Independent Component Analysis (ICA).18,28,29
Unsurprisingly, we can conclude that ensuring a clean background and removing contaminant m/z peaks (from any source) as far as possible yields improved measurement repeatability. Before moving on to consider active methods for eradicating signals from residual tuning mix prior, we note that frequent cleaning of the ASAP ion source is an important measure that should be taken in order to minimise background interference and enhance data quality. The required cleaning frequency is sample-dependent, but as an example, when processing large numbers of plasma samples in our own laboratory we clean the source after around 150 measurements.
As explained in Section 2, we investigated two different flushing methods for removing residual tuning mix from the ASAP ion source, employing air or a 50:
50 mix of ethanol and LC-MS water, respectively. The results are shown in Fig. 1(b) in the form of plots of total ion count (i.e. the integrated signal across the entire mass spectrum) as a function of time. Air (red line) or solvent (blue line) is flushed through the ion source for the first 90 seconds, before the inlet valve is closed and signal is recorded for a further 8.5 minutes. Compared with the air flush, flushing with solvent results in much higher signal levels during the flush, but considerably lower and more stable signal levels (see inset to figure) once equilibrium is reached following the flush. As well as reducing the overall ion count more effectively, the solvent flush was also more effective at removing the specific m/z peaks arising from the tuning mix (see Fig. S5 of the ESI†), and was the flushing method of choice for all subsequent measurements.
The cooling curve shows the expected exponential decay, with rapid cooling to a little over 100 °C occurring during the few seconds taken to remove the probe from the ion source and position it in front of the camera. The tip reaches a temperature of ∼37 °C after around 20 s, after which it is cool enough for loading of biological samples. After 30 s the tip has cooled to 26 °C, just a little over room temperature.
Fig. 3(c) and (d) show PCA plots for the two sets of 25 mass spectra recorded for a frozen brain sample using cleaning methods 1 and 2 (the complete set of mass spectra used in this analysis can be found in Fig. S6 and S7 of the ESI†). Panel (c) shows principal components for the full m/z 10–1000 mass range of the mass spectrometer, while panel (d) shows the results for a truncated mass range of m/z 200–300, chosen to isolate the contributions from mass peaks associated with the lens tissue used in the manual cleaning step. In both cases the spread in the data is reduced considerably when using cleaning method 2. Despite the clear reduction of spread in the PCA plot, a t test shows that this reduction is not statistically significant when the complete mass spectra are considered; however, it is highly significant (p < 0.001) over the m/z range in which the lens tissue peaks appear. These results suggest that while it can be difficult to detect significant influences of contaminants on the complete mass spectra, explaining why such contamination can sometimes be accidentally and subtly introduced into clinical data sets, local influences can be highly significant, and should be minimised wherever possible.
Fig. 3(d) shows box and whisker plots of the intensities recorded using both cleaning methods for the three most intense peaks arising from lens tissue residue. Significant reductions are seen in the intensities of the peaks at 257 and 285, while the peak at m/z 275 is sufficiently low in intensity under all conditions that the second step does not lead to a statistically significant reduction. Further analysis of these data can be found in Fig. S8 of the ESI.†
We can conclude that cleaning method 2 is superior to cleaning method 1, and this method was employed in all subsequent measurements. This includes all measurements reported in this manuscript, unless stated otherwise.
Considering first the sample tubes, brand 1 yields consistently low total ion signals and a small number of very low intensity peaks in the mass spectrum, in line with previous work by Canez et al.17 In contrast, brands 2 and 3 yield significantly higher total ion signals and – particularly in the case of brand 3 – a large number of high-intensity peaks in their mass spectra. A second set of measurements, made on LC-MS water left overnight in each brand of tube (see Fig. S9 of the ESI†), yielded very low signal for all three brands of tube. This suggests that all three tubes are generally suitable for use with ASAP-MS analysis as long as direct contact of the ASAP probe with the tube walls is avoided.§
Considering next the three brands of lens tissue, we see large total ion signals and numerous peaks in the mass spectra for all three brands. In all three cases, three of the most intense peaks appear at m/z 257, 275, and 285, with brand 1 yielding another peak of high intensity at m/z 338. While these signals have the potential to contaminate the sample mass spectra, as shown in Section 3.3, residue from lens tissue used to clean the ASAP probe tip can be removed effectively simply by inserting the probe into the hot N2 gas flow inside the mass spectrometer ion source for a short period of time prior to making measurements on samples.
We can conclude that careful and consistent selection of consumables, and clear and consistent handling protocols, are likely to be key in order to avoid the potential for brand-specific contamination of the sample mass spectra. New consumables should be evaluated carefully in order to establish their potential to contaminate mass spectra, either directly or via ion suppression effects. Direct contact of the sample or probe with the consumables should be avoided where possible, and rigorous cleaning techniques should be employed when reusing glass capillaries in the ASAP probe. Caution should also be exercised in situations where m/z peaks arising from consumables overlap with ‘biological’ peaks arising from the sample. Ideally, such peaks should not be used for classification, and in cases where this is not possible, interferences should be carefully considered and quantified.
The problem is illustrated quite clearly even using the small number of measurements employed in the present study. As explained in Section 2, we recorded spectra from 12 brain tissue samples from six patients, with cerebellum and anterior cingulate cortex samples for each patient. The patients were randomised into two different groups, with their samples comprising ‘batch 1’ and ‘batch 2’, respectively. To generate Data Set 1, which we can use as a control, measurements were made on all 12 samples (both batches) on the same day, while for Data Set 2 the two batches were analysed on different days. Fig. 6(a) and (b) show PCA plots (PC1 vs. PC2 and PC2 vs. PC3 in each case) for the two data sets, with the centroid rings superimposed as shaded regions in each case. For Data Set 1 (Fig. 6(a)), we see good overlap between the measurements made on the two batches, implying that they were recorded under consistent measurement conditions. For Data Set 2 (Fig. 6(b)) we see significant separation between the two sets of measurements, particularly in the first principal component, and more overlap between data points for the two different tissue types. To the best of our ability, the measurements were all made under identical conditions, so this is a clear example of a batch effect, in line with those observed by other authors.19 If not corrected or appropriately accounted for, the batch effect could easily confound any effects arising from true biological variability, and lead to the drawing of highly misleading conclusions. Some further analysis of the mass spectral features that contribute most significantly to the batch effects can be found in the ESI.†
In Fig. 6(c) and (d), we show the results of performing a batch correction on Data Set 2, using the ComBat and independent component analysis (ICA) methods, respectively. The ComBat algorithm18,28 assumes that batch effects affect many m/z peaks in similar ways and uses an empirical Bayesian approach to adjust for these effects. In the ICA method,29 the complete data set is factorised into components using one of a number of matrix factorisation methods, and the components that show significant correlation with the individual batches are removed. Inspecting the data in Fig. 6(c) and (d), we see that both methods yield a reduction in the batch-related separation along PC1 and improve the clustering according to sample type (cerebellum vs anterior cingulate cortex). However, the data corrected using the ComBat method still show separation by batch, while the separation is almost completely removed by the ICA method. The latter results in almost complete overlap of the centroid rings, similar to that observed in Data Set 1, which was recorded in a single batch. In the present example, the ICA method appears to perform better than the ComBat method, but this would need to be tested much more extensively with larger data sets before drawing any firm conclusions on this point.
Fig. 6(e) and (f) show the results of a more in-depth statistical analysis of the batch effect correction, in the form of a correlation analysis. Panel (e) shows p values for the correlation between batch and the first three principal components of Data Set 2 before and after batch correction using the ComBat and ICA methods. Before batch correction (red diamonds), the data show a statistically significant correlation with batch (in the plot they lie below the red dashed line indicating p < 0.05). Batch correction with the ComBat method (blue circles) reduces the correlation, but p still indicates a significant correlation, while the ICA method reduces the correlation substantially (p ∼0.5). Panel (f) explores the correlation between the principal components of the data sets and the sample type for Data Set 1 and for Data Set 2 before and after batch correction. For Data Set 1, recorded in a single batch, PC2 (and perhaps PC3) reveals a statistically significant correlation with sample type. For Data Set 2, PC2 shows no significant correlation with sample type before batch correction. The correlation becomes significant following correction with the ComBat algorithm, but (interestingly given the apparent overall better performance of this algorithm) not with the ICA algorithm. Correlation between sample type and PC3 becomes statistically significant for Data Set 2 following correction with either approach. These findings suggest that while batch effects can obscure true biological variation, employing batch effect correction methods can recover these biological differences to varying degrees. However, as noted by other authors,30 it is often not possible to remove batch effects completely.
Based on our small proof-of-concept demonstration, we can conclude cautiously that batch effects are significant in ASAP-MS measurements, but that they can be mitigated by the use of batch effect correction methods. However, further investigation with much larger data sets is needed before drawing any firm or quantitative conclusions on the most effective correction methods or the extent to which biological variations can be recovered.
(i) Ensuring utmost cleanliness of the ion source, which should be free from all contaminants, including residual calibration tuning mix;
(ii) Ensuring sufficient cooling of the ASAP probe tip before loading with biological samples;
(iii) Developing an appropriate cleaning protocol for the probe between measurements. Standard cleaning approaches can be enhanced by an additional step in which the probe tip is exposed to the hot N2 gas flow inside the ASAP ion source for a short period of time;
(iv) Consumables must be carefully evaluated for their potential to introduce contaminant peaks to the mass spectra, and should be chosen to minimise such contamination;
(v) All measurement procedures should be standardised as far as possible and written down clearly and comprehensively so that they can be followed as closely as possible by different users;
(vi) Even with strict adherence to well-designed standard operating procedures, it is not possible to avoid batch effects arising from both known (e.g. different users) and unknown causes, and these should be carefully considered and corrected where possible. In the present study, we have shown using a small set of sample data that both the ComBat and ICA methods for batch correction can mitigate these effects to varying degrees and preserve true biological variations, opening the way to further, more quantitative studies with larger data sets.
As a relatively low-cost, rapid, and straightforward measurement technique, ASAP-MS holds considerable appeal for clinical applications. However, it is still a relatively new technique in the clinical arena, and measurement protocols must be carefully optimised in order to maximise its potential. The work presented in this paper and in our previous study on plasma measurements15 lays down some of the foundations for achieving these goals.
Footnotes |
† Electronic supplementary information (ESI) available: A large number of sample mass spectra relating to the discussion in the manuscript, as well as additional statistical analysis of some of the data, a diagram of the experimental set up for the thermal imaging measurements, and tables of parameter settings for the Advion ASAP-MS instrument, the OMNI homogeniser, and the thermal imaging camera. See DOI: https://doi.org/10.1039/d5an00166h |
‡ ‘Repeatability' in a measurement is achieved when the measurement consistently produces the same outcomes under identical circumstances (same operator, apparatus, laboratory, and within a short period of time), while ‘reproducibility’ is defined as close agreement between measurements performed using the same method and identical test material but under different circumstances (e.g. different operators, apparatus, laboratories, or after a significant interval of time).13,14 |
§ Note that all samples were homogenised in water. Other solvents were not tested, and in principle may yield different results. |
This journal is © The Royal Society of Chemistry 2025 |