Yuewen
Zhang‡§
a,
Maya A.
Wright§
a,
Kadi L.
Saar§
ab,
Pavankumar
Challa§
a,
Alexey S.
Morgunov§
ac,
Quentin A. E.
Peter
a,
Sean
Devenish
c,
Christopher M.
Dobson¶
a and
Tuomas P. J.
Knowles
*ab
aYusuf Hamied Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK. E-mail: tpjk2@cam.ac.uk
bCavendish Laboratory, Department of Physics, University of Cambridge, J J Thomson Ave, Cambridge CB3 0HE, UK
cFluidic Analytics Ltd., Cambridge, UK
First published on 22nd March 2021
The ability to determine the identity of specific proteins is a critical challenge in many areas of cellular and molecular biology, and in medical diagnostics. Here, we present a macine learning aided microfluidic protein characterisation strategy that within a few minutes generates a three-dimensional fingerprint of a protein sample indicative of its amino acid composition and size and, thereby, creates a unique signature for the protein. By acquiring such multidimensional fingerprints for a set of ten proteins and using machine learning approaches to classify the fingerprints, we demonstrate that this strategy allows proteins to be classified at a high accuracy, even though classification using a single dimension is not possible. Moreover, we show that the acquired fingerprints correlate with the amino acid content of the samples, which makes it is possible to identify proteins directly from their sequence without requiring any prior knowledge about the fingerprints. These findings suggest that such a multidimensional profiling strategy can lead to the development of a novel method for protein identification in a microfluidic format.
The possibility to detect the presence of hitherto unknown targets and perform explorative screening arises when affinity-reagent free protein analysis approaches are used. In this context, various forms of mass-spectrometry have been widely used for protein identification for many decades due to their high sensitivity, resolution, accuracy and dynamic range.19,20 In a typical experiment, fragments of proteins are formed and separated through approaches, such as liquid chromatography before their injection to a mass-spectrometer.21–24 While top-down identification has allowed characterising a number of different protein species, its application becomes challenging in the limit of high molecular weight and low solubility species. Due to these limitations, less than 10% of mammalian proteome can be accessed through these techniques.25 For the analysis of higher molecular weight species, bottom-up sequencing approaches have been developed, which usually involve proteolysis of a complex mixture of proteins followed by a chromatographic separation of the peptides prior to their sequencing through tandem mass spectrometry (MS/MS). Whether the analysis is performed in a top-down or bottom-up manner, mass-spectrometry generally requires extensive sample preparation steps, often resulting in significant losses, and long experimental analysis time. Moreover, the presence of less abundant species is usually masked by more abundant ones, which prevents it effective use for detecting targets that are present at low concentrations, as is the case for biomarkers during the onset and early stages of diseases. Last but not least, its operation in gas-phase, has made it challenging to extend the analysis to protein complexes that are held together through transient interactions.
Recently, different approaches that would enable overcoming some of the challenges encountered with mass spectrometry have been demonstrated and proposed. For instance, Swaminathan et al.26 have demonstrated the possibility of immobilising peptides onto a glass slide and measuring their fluorescence through total internal reflection microscopy in consecutive cycles of Edman degradation after selectively labelling lysine and cystine residues. While demonstrating the first steps towards the feasibility of single molecule peptide fluorosequencing,27 the approach involves a number of consecutive Edman steps, setting a limit on the speed at which the analysis can be performed.
To open up the possibility of minute-scale liquid-phase protein identification, here, we devised and demonstrated a microfluidic platform that permits the identification of protein samples on a single device by relying on obtaining its characteristic multidimensional physicochemical signature. Specifically, by using a multi-wavelength detection system, we obtained readouts describing the tryptophan (Trp), tyrosine (Tyr) and lysine (Lys) content of the protein sample together with an estimate for their hydrodynamic radius (Fig. 1). By obtaining such multidimensional signatures for a total of ten proteins and using machine learning approaches for identifying the origin of a set of validation proteins, we showed that such a strategy can be used for identifying proteins with a high confidence. The characterisation and identification process is performed on unlabelled protein samples and on a minute timescale.
Finally, in order to account for any potential fluctuations in the power output of the LEDs, the intensities of standard calibration solutions (10 μM L-tryptophan and 10 μM 4-methylumbelliferone both in 400 mM potassium borate buffer at pH 9.7) were recorded in a channel adjacent to the identification device itself. The measured characteristic tryptophan and tyrosine fluorescence values were then normalised by the former of this calibration readings and the lysine value by the latter of the two calibration readings.
The protein sample was then surrounded by a co-flowing buffer in order to monitor the lateral diffusion of the protein sample into an auxiliary carrier medium in space and in time. Such a strategy has been previously shown to yield the diffusion coefficients of protein samples.30 In particular, the device we used in this study was designed for the camera field of view (800 μm × 1000 μm) to include four distinct sections of this channel (blue highlighted region), so that a single image could be used to extract the diffusion coefficient as described earlier.31 The channels were imaged using the 280 nm excitation LED in combination with the tryptophan filter as the signal from latter residue was stronger than the signal from tyrosine residues. The diffusion profiles on the micrographs were then fitted to simulated basis functions for particles of known radii and each of the simulated profiles were compared to the measured profiles in order to extract the hydrodynamic radius of each sample.30–33
Finally, downstream the sizing unit, an on-chip latent labelling strategy was used to conjugate the lysine residues in each protein to o-phthaldialdehyde (OPA) dye molecules32,34 (Materials and methods). The characteristic fluorescence intensity from the OPA labelled lysine residues was measured (pink highlighted region) by switching the UV-LED light source to an LED light source with excitation at 365 nm wavelength (Materials and methods) at which unconjugated OPA molecules have been observed to show only minimal background fluorescence. The dimensions of the labelling channel were chosen such that the OPA dye and the protein sample would be able to mix for over a 3 second long time period before the measurement was taken, a time scale that we had previously shown allows quantitative insight into the abundance of lysine residues in proteins.32 The devices worked reliably with no major failure modes noted. To limit potential contamination between samples, each new measurement was performed in a separate device.
In summary, this strategy allowed us to obtain a four-dimensional signature for each protein sample using a single microfluidic platform and a dual-wavelength excitation system. One of the four measured parameters was later used for normalising the obtained fluorescent signals. This process ensured that the obtained signatures were independent of the sample concentration.
In order to eliminate concentration dependence, the measured signals in the tryptophan and tyrosine imaging channels were normalised by the signal in the lysine filter. This reduced the data structure to a three-dimensional signature but ensured that the obtained values were independent of the concentration of the protein that was used for analysis. Moreover, the measured intensities were corrected for fluctuations in the laser power by also measuring the fluorescence intensities of calibration solutions in a neighbouring channel, involving L-tryptophan and 4-methylumbelliferone molecules for the 280 nm and the 365 nm LED, respectively (Materials and methods).
The characteristic spaces that each of the analysed ten proteins occupied in a three-dimensional plot are shown in Fig. 3d with the 1D projections shown in Fig. 3a–c and the underlying data summarised in ESI† Table S1. In particular, the three-dimensional visualised ellipsoids (Fig. 3d) were defined by the centres being the average of the four measurement points and their radii corresponding to the standard deviation of the four measurements. We noted that the ten analysed proteins varied in their physiochemical signatures with Fig. 3d illustrating that it is likely that across a three-dimensional landscape each of the protein acquires a different signature.
We note that all the experiments were performed with protein concentration in the micromolar range (ESI† Table S1). Substantial decreases in these values would be possible. Specifically, we have previously demonstrated the possibility to detect proteins down to concentrations of around 100 nM using UV-fluorescent detection in microfluidic devices.28 This sensitivity limit could be improved when an advanced detection mechanism (e.g. confocal-detection) or a higher laser powers is used, or if the device is fabricated from a material that shows a lower degree of autofluorescence35 or in a manner where the autofluorescence from the microfluidic device would be suppressed.36
First, using the full data set of 10 classes of proteins with 4 experimental repeats for each class, leave-one-out cross-validation was used to assess the likelihood that a particular sample is classified as the correct protein. In particular, we argued that the errors in the measurements are likely to be Gaussian distributed and set out to use a multivariate Gaussian model for developing the classification algorithm. Specifically, multivariate Gaussian distributions were fitted to each of the ten protein classes with the means computed from the four repeats within each class, or from the three remaining repeats for the class from which the validation sample was removed. The covariance matrices were computed by combining the group variance (using either four or three repeats similarly to the means) with the global variance involving the full dataset of 39 data points excluding the validation sample. A weighting factor of 0.9 was used for the group variance and a weighting factor of 0.1 for the global variance to avoid singular covariance matrices and ensure computational stability while simultaneously taking advantage of the extra information about the system as the variances in the same dimension between the different classes are likely to be similar. Finally, the likelihood of each of the validation samples belonging to each of the protein classes was calculated by estimating the probability density function of the individual multivariate Gaussians at that point.
For each protein class, the likelihood was averaged across the four experimental repeats and the resulting values were normalised to one. Fig. 3e shows a heatmap of the calculated likelihoods for assigning proteins into available classes with the actual protein being measured on the vertical axis and the protein it is likely to be identified as on the horizontal axis. We observed that, individually, 33 out of 40 samples were classified correctly. Moreover, it can be seen that on average proteins are likely to be assigned to the correct class with high confidence.
The above estimates were arrived at by assuming that the errors in the measurements in each dimension were normally distributed, so that the protein classes can be approximated by multivariate Gaussian distributions. In order to improve our analysis and devise an analysis strategy that is not making an assumption about the distribution of the errors, we constructed a random forest classifier. As before, leave-one-out cross-validation was used on all 40 samples. In order to reduce variance, each random forest was trained with 1000 decision trees that were built using bootstrapping and with only 2 out of 3 variables selected at random to build each tree. The classification was performed using predictions by these ensemble models and, subsequently, predictions by all individual trees in the ensembles were collected to quantify the confidence of the ensemble model in making the predictions. For each group of four samples corresponding to the same protein class, the average number of trees in the ensemble predicting each target class were taken and normalised to sum to one for each protein.
Finally, a heatmap summarising the results was constructed, similarly showing the actual protein being measured on the vertical axis and the protein it is likely to be identified as on the horizontal axis (Fig. 3f). The results illustrate that the model predicts the correct class of proteins with high confidence. Moreover, on the individual level, the random forest model misclassified only 4 out of 40 samples, demonstrating a superior performance to the multivariate Gaussian model. This shows that highly accurate identification of proteins is possible even when no assumptions are made about the underlying distributions of measurement errors or data structure. We thank the referee for the suggestion to investigate the errors of the models in more detail. First, in the classification task only 4 misclassification events occurred, two of which corresponded to ubiquitin, one to β-casein and one to glucose oxidase. The remainder of the four samples for these proteins get identified correctly, which ensured that, on average, all proteins get identified correctly as seen in Fig. 3f. The misclassification of these events can be explained by their relatively similar multidimensional signatures (Fig. 3d). The four misclassification events corresponded to two ubiquitin, one β-casein and one glucose oxidase sample, likely originating from the close similarity of their multidimensional signatures (Fig. 3d). The remainder of the four samples for these three proteins were identified correctly, which ensured that, on average, all proteins get identified correctly as seen in Fig. 3f.
Collectively, these results suggest that a model that makes no assumptions about the underlying data structure performs more accurately than the model that assumed that measurement noise was Gaussian distributed. Given the limited amount of training data for each test case (n = 39 samples), one possible strategy to improve the accuracy of the current model further is to train the model multiple times by each time sampling only some of the training points and create an ensemble classifier.37 We employed this strategy and trained 100 classifiers each time randomly sampling 80% of the data. The models were then combined by setting the final prediction to be the mode of these 100 independently trained models. Using this strategy, we observed a small improvement to 37 proteins identified correctly. Even though the improvement in the current data set was small, more generally, such an approach provides a valuable strategy for reducing uncertainty in predictions and their sensitivity to outliers.37
To this effect, we first derived relationships that could be used to predict the sequence composition of each sample from its fingerprint. Indeed, the measured fluorescent signals were constructed in a manner where they can be expected to predominantly originate from the tyrosine, tryptophan and lysine residues of the proteins (Materials and methods) or, in the case of the hydrodynamic radius, be linked to its molecular weight. The observed correlation between the measured fluorescent signals and the amino acid content of the proteins are shown in Fig. 4a and b. As before, in order to eliminate concentration dependence, ratios between the measured signals were used. We note that these panels include all the 40 samples that were studied with all the points also used when estimating the best-fit line (dotted line). However, when the actual identification process was performed (next paragraph), each time, the test protein as well as all other proteins of the same type were excluded from the fit to ensure that there is no information leakage. Fig. 4c additionally outlines the relationship between the measured hydrodynamic radius and the molecular weight of the proteins. We modelled all the three relationships as linear regressions with zero-intercepts and estimated the gradient of the line by minimising the ordinary least squares (Fig. 4a–c, dotted lines). It is possible that when more abundant data is available, more nuanced relationships between the measured signals and the sequence-specific quantities can be learned. However, for the current analysis we used the prior of a linear relationship as the use of a model with a larger number of parameters may have resulted in overfitting.
Next, the derived relationships (Fig. 4a–c) were used to convert the measured three-dimensional signature of our test samples into their predicted and ratios and molecular weights. To eliminate any information leakage, we re-fitted the linear regression after excluded all the measurements that involved the test sample, reducing the size of the data that was used for fitting down to 36 points. Following this step, the z-score of the measured sample being a particular protein in the database was calculated by using the estimated sequence-specific properties of the protein that the sample was assumed to be as the mean value and the measurement noise as the standard deviation when defining the clusters. The heatmap describing the probability of the test sample being a specific protein is shown in Fig. 4d with the data arranged such that the correct sample appeared on the diagonal of the matrix. These data show that the samples that corresponded to proteins that followed the linear relationship closely (e.g. sample F; ovalbumin) achieved high identification accuracy while samples corresponding to proteins that did not follow the relationship so well (e.g. sample C, β-casein) showed substantially poorer identification performance (Fig. 4d). Individually, 21 out of the 40 samples were identified correctly. Moreover, when the probability estimates from the four repeats performed on each protein were combined, 7 out the 10 proteins were identified correctly (Fig. 5). These results illustrate that not only can the multidimensional signatures used for classifying proteins into pre-determined clusters (Fig. 3e and f), it is also possible to convert the measured signals into absolute sequence-specific parameters and through this process identify the test samples.
Fig. 5 Comparison of the performance of the protein classification (Fig. 3) and identification (Fig. 4) strategies. When identifying a measured sample directly from its sequence, samples were identified correctly on 53% of the occasions or on 70% of the occasions when the results were averaged across the four repeats performed on each samples. When pre-determined fingerprints were used, proteins were classified correctly on 83% of the occasions or on 100% of the occasion when the results were averaged across the repeats. The red dotted line corresponds to the case where the classification or identification was performed by a process of random guessing. |
Analysing the misidentification events in more detail, we noticed that two proteins, β-casein and transferrin were never identified correctly. This effect likely originated from these proteins being among the most significant outliers from the approximated linear relationships (Fig. 4a–c). While we had chosen to use these linear models for simplicity, it is clear that this strategy cannot capture the full nuance. The effect can be particularly pronounced for the UV-fluorescent signals, where not all tryptophan and tyrosine residues contribute to the emitted fluorescent signal equally but their contribution is defined by the local environment and expose to the solvent. As such, the relationships between the actual amino acid ratios and measured fluorescent signals are likely more complex functions also involving the protein fold and prior insight into such relationships is one of the parameters that would allow us to achieve an enhanced identification accuracy.
As in a representative practical example a random set of proteins does not lie in regularly spaced intervals, this estimate can be viewed as an upper bound for the resolution capacity. We therefore speculate that our demonstrated multidimensional profiling strategy proves the most useful when a handful of proteins are present. This is the case, for instance, when a complex mixture has been first purified on a 2D-gel that allows additional information about the protein to be obtained, such as its electrophoretic mobility. We further note that significant improvements on the currently demonstrated resolution capability are possible. This objective could be achieved either by ensuring that the accuracy of individual measurements is increased or by incorporating additional dimensions, such as the charge of the protein that could be accessed by an on-chip electrophoresis step.40,41
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d0lc01148g |
‡ Current address: Department of Chemistry, Lanzhou University, Lanzhou, Gansu, 730000, P. R. China. |
§ These authors contributed equally. |
¶ Deceased September 8, 2019. |
This journal is © The Royal Society of Chemistry 2021 |