Joseph
Mason
*a,
Harry
Wilders
bc,
David J.
Fallon
b,
Ross P.
Thomas
bc,
Jacob T.
Bush
b,
Nicholas C. O.
Tomkinson
c and
Francesco
Rianjongdee
a
aMedicinal Chemistry, GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire SG1 2NY, UK. E-mail: joe.mason.chem@gmail.com
bChemical Biology, GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire SG1 2NY, UK
cDepartment of Pure and Applied Chemistry, University of Strathclyde, 295 Cathedral Street, Glasgow, G1 1XL, UK
First published on 19th October 2023
High-throughput experimentation for chemistry and chemical biology has emerged as a highly impactful technology, particularly when applied to Direct-to-Biology. Analysis of the rich datasets which come from this mode of experimentation continues to be the rate-limiting step to reaction optimisation and the submission of compounds for biological assay. We present PyParse, an automated, accurate and accessible program for data extraction from high-throughput chemistry and provide real-life examples of situations in which PyParse can provide dramatic improvements in the speed and accuracy of analysing plate data. This software package has been made available through GitHub repository under an open-source Apache 2.0 licence, to facilitate the widespread adoption of high-throughput chemistry and enable the creation of standardised chemistry datasets for reaction prediction.
Commercial software packages are available to automate the analysis of LC-MS data for high-throughput experimentation;25–29 they are typically agnostic to the brand of LC-MS machine used and are capable of processing the raw data from the instrument directly. However, the associated cost of these commercial solutions can be prohibitive. Furthermore, not all tools are specific to the analysis of high-throughput chemistry, and the closed nature of proprietary software development may hinder customisation for each user's needs. Solutions from the academic community include the method published by Steimbach et al. using Visual Basic and Spotfire;30 however, selection of the appropriate LC-MS peaks was performed manually. Osipyan et al. developed and published their Python tool to analyse plate-based mass-spectrometry data, using the observed mass-to-charge ratios to predict the abundance of the desired product.31 Most recently, Haas et al. published MOCCA, an open access Python tool for the analysis of plate-based high performance liquid chromatography (HPLC) data;32 other open-access tools for HPLC data are also available.33 Whilst all of these options provide certain functionality for specific users, we believed that there was a need to develop an open-source solution that would be suitable for the analysis of LC-MS data from both D2B and reaction optimisation experiments. An open-source solution was particularly attractive to us: the ability to adapt as required and implement into alternative workflows was considered a key advantage over commercial solutions.
The result of our endeavour was a Python program capable of reading and analysing (or “parsing”) LC-MS data for high-throughput chemistry experiments, which has been named PyParse. This program, already used across multiple departments at GSK, has been released under the open-source Apache 2.0 licence on GitHub with documentation and sample data.34 This article, intended as a user guide for the bench chemist, provides further detail and real-life examples of using PyParse (Fig. 1).
Fig. 1 Schema for high-throughput experimentation workflow using liquid chromatography-mass spectrometry (LC-MS). |
The output from PyParse comprises a heatmap, annotated visualisations, and verbose descriptions to precisely describe how the analysis was computed. Tabulated results are also generated in the form of .csv files to facilitate the upload of the data to a suitable database, thus enabling data mining and reaction prediction efforts. All visualisations are presented to the user in an HTML web page, chosen for its flexibility in styling and layout, as well as its intuitive user interface.
As part of the output, PyParse also generates a summary table (see Fig. S1 of ESI†); here, the retention time is reported for each compound, along with the ID of the well with the highest purity or ratio to internal standard. Two columns labelled “Overlap Detection” and “Potential Conflicts” are also present; they are designed to alert the user if PyParse has detected that the reported purity may not be reliable. “Overlap Detection” finds where a second peak has overlapped with the product peak in the most successful well. The “Potential Conflicts” column alerts the user to cases where the compound peak eluted at the same retention time as that observed for another compound, potentially compromising the reported LC-MS UV area. In each case, these warnings notify the chemist to scrutinise that well manually, though it should be noted that it remains the user's responsibility to verify the quality of the LC-MS data.
The comprehensive outputs are designed to allow the bench chemist to interpret the data more easily, leading to the submission of compounds to a biological assay, optimised conditions for a chemical transformation, or new insights into the reaction mechanism.
In the original publication by Thomas et al.,13 more than 1000 reactive fragments were synthesised via an amide coupling reaction (Fig. 2A), then screened using a Photoaffinity Bit screening platform. The purity of each fragment was estimated from the LC-MS UV peak percentage area, which was determined manually for each reaction. This in-house dataset, which had already been carefully analysed, represented the ideal opportunity to evaluate the performance and fidelity of PyParse. The original LC-MS data files obtained by Thomas et al., covering the four separate 384-well plates, were re-analysed by PyParse, where only the SMILES (Simplified Molecular Input Line Entry System) of the product in each well was provided in the platemap. We deliberately opted not to provide the observed retention time to PyParse, as we believed this represented a fairer comparison with the manual analysis conducted by Thomas et al. The eight minutes and 36 seconds taken for PyParse to analyse the data and prepare >2000 separate visualisations represented a substantial time saving compared with the original manual analysis, which was estimated to have taken over 34 h (two minutes per well × 1026 wells, Fig. 2B). A meta-analysis was then conducted, whereby the output from PyParse was compared against the analysis by Thomas et al. Results were assigned to one of three categories: “Correct”, where the assignment from PyParse matched the original publication; “Incorrect”, where the assignment by PyParse did not match; and “Ambiguous”, where closer (manual) inspection of the LC-MS data revealed that there were multiple peaks which contained the required m/z (mass:charge ratio) for the desired product, resulting in a differing but inconclusive result (see ESI,† page S14, for further details). This meta-analysis revealed that the PyParse analysis matched the original assignment in over 95% of cases (Fig. 2C). We concluded this was a sufficient level of reliability, given that this experiment of >1000 wells contained numerous complex reaction profiles and/or products with poor UV absorption characteristics (see ESI, Fig. S3 and S4†).
Fig. 2 Meta analysis for the in-house parallel synthesis of 1026 reactive fragments, previously published by Thomas et al.13 (A) Generic scheme for the amide coupling reaction, and the stacked heatmaps that were generated by PyParse; (B) comparison of the time taken to analyse the full collection of LC-MS data (note: manual analysis was estimated at two minutes per well × 1026 wells); (C) donut chart comparing the assignment by PyParse with the original manual analysis: “Correct”, where the assignments by PyParse and Thomas et al. were in agreement; “Incorrect”, where PyParse failed to find the peak assigned by Thomas et al.; “Ambiguous”, where closer inspection of the LC-MS data revealed there were multiple peaks which contained the required m/z for the product, resulting in an ambiguous assignment. |
Overall, the meta-analysis conducted exemplifies PyParse's high level of accuracy and performance, which serves to demonstrate that the confidence placed in PyParse for in-house D2B analyses is well founded.
The second example of a PyParse analysis relates to a reaction screening plate that was conducted during in-house efforts to validate a plate design for the C–H activation of oxazoles at the C2 position (Scheme 1A). The plate design (Scheme 1B and Fig. S5 of ESI†) was inspired by numerous publications in the field,37–45 and hinged on the use of a palladium catalyst with a copper co-catalyst to enhance C2 selectivity. In the planning stage, we identified that the regioselectivity of the transformation posed a key risk, as both the C2 and C5 positions are precedented to undergo Pd-catalysed arylation under these conditions.42 As both the identity of the palladium catalyst and the base had been reported to influence the regioselectivity, the plate design focused on these two parameters (Scheme 1B).46,47 After heating for 18 h at 100 °C, LC-MS data were obtained for the plate and analysed using PyParse. The hit validation graph for the desired product 4.1, generated automatically by PyParse (Scheme 1C), proved exceptionally useful in determining the outcome of this optimisation. This graph plots all hits (defined as any LC-MS peak which contains the expected m/z) for a particular compound according to their retention time against the well in which they were found. The markers are sized by the UV percentage peak area of the hit, shaded by whether they were included (black) or excluded (red) from the final output (i.e., the heatmap), and shaped by the cluster to which they were assigned by the PyParse algorithm. The purpose of this graph is to allow the user to visualise the results of this algorithm;34 successful validation is indicated by a horizontal straight line of black markers, consistent with a set of peaks which have the same retention time irrespective of the well. Using the hit validation graph for the product (Scheme 1C), two predominant sets of hits at 0.71 and 0.73 minutes were rapidly identified. Each set had a consistent retention time, thus indicating that these were the likely retention times of the two expected regioisomers 4.1 and 4.2. By resubmitting the data to PyParse and specifying a retention time of either 0.71 or 0.73 minutes, heatmaps (see ESI†) and tabular datasets were generated for each regioisomer in turn. We also noted the presence of the bis-arylation product 4.3 in the plate, identified by the mass:charge ratio, which comes as a consequence of over-reaction. The heatmaps (see Fig. S7 and S8 of ESI†), in conjunction with manual analysis of the tabular datasets, enabled us to identify wells C12 and D6 as the best performing conditions for 4.1 and 4.2 respectively (Scheme 1D). The subsequent scale up (see ESI†) and isolation of the two products allowed us to confirm the regiochemistry for each isomer by HMBC and ROESY.
The visualisations generated by PyParse for this challenging reaction optimisation plate facilitated the rapid identification of the appropriate reaction conditions for each regioisomer. These results led us to conclude that the plate design used was indeed effective for the optimisation of C–H activation reactions for oxazoles; investigations into a wider substrate scope are currently ongoing and will be reported in a future publication.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00167a |
This journal is © The Royal Society of Chemistry 2023 |