Ziqi Wanga and
Kenry*abc
aDepartment of Pharmacology and Toxicology, R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, USA. E-mail: kenry@arizona.edu
bClinical and Translational Oncology Program and Skin Cancer Institute, University of Arizona Cancer Center, University of Arizona, Tucson, AZ 85721, USA
cBIO5 Institute, University of Arizona, Tucson, AZ 85721, USA
First published on 29th April 2025
Interrogation of the secondary structures of proteins is essential for designing and engineering more effective and safer protein-based biomaterials and other classes of theranostic materials. Protein secondary structures are commonly assessed using circular dichroism spectroscopy, followed by relevant downstream analysis using specialized software. As many proteins have complex secondary structures beyond the typical α-helix and β-sheet configurations, and the derived secondary structural contents are significantly influenced by the selection of software, estimations acquired through conventional methods may be less reliable. Herein, we propose the implementation of a machine-learning-based approach to improve the accuracy and reliability of the classification of protein secondary structures. Specifically, we leverage supervised machine learning to analyze the circular dichroism spectra and relevant attributes of 112 proteins to predict their secondary structures. Based on a range of spectral, structural, and molecular features, we systematically evaluate the predictive performance of numerous supervised classifiers and identify optimal combinations of algorithms with descriptors to achieve highly accurate and precise estimations of protein secondary structures. We anticipate that this work will offer a deeper insight into the development of machine-learning-based approaches to streamline the delineation of protein structures for different biological and biomedical applications.
Over the years, many microscopic and spectroscopic technqiues have been employed to better understand protein structures. For example, cryo electron microscopy is actively being utilized to resolve the three-dimensional (3D) structures of proteins at near-atomic resolution.16,17 Fluorescence spectroscopy is commonly used to examine the tertiary structure of proteins,18,19 while their secondary structure is widely characterized using near- and far-ultraviolet (UV) circular dichroism (CD) spectroscopy.20,21 The secondary structure of a protein is the local spatial conformation of its polypeptide backbone, and can be employed to predict the overall 3D structure of a protein. In general, the secondary structure of a protein can be broadly classified into two predominant types, i.e., alpha helix (α-helix) and beta sheet (β-sheet) structures. The α-helix structure is a right-handed spiral structure within a single polypeptide chain, while the β-sheet structure comprises at least two adjacent stretches of polypeptide chain in a fully extended conformation. It is, nevertheless, noteworthy that other less common types of secondary structure exist too. These include beta turn (β-turn) or beta bend (β-bend) and omega loop, which are non-regular and non-repeating secondary structural motifs. Also, while a protein may assume a predominant secondary structure, the same protein may have certain percentages of other configurations as well. For instance, although the secondary structure of human serum albumin is predominantly α-helix, which accounts for 67%, this protein has 10% turns and 23% random coils.22 The presence of the different secondary structural components is essential for regulating the activities and functionalities of proteins, including protein folding and protein–ligand interactions.
One of the most common techniques to elucidate protein secondary structure is CD spectroscopy.23–25 In partic, this optical spectroscopic technique relies on the difference in the absorption of the right- and left-circularly polarized light by a protein to infer its secondary structure. Due to the distinct arrangement of the polypeptide chains of α-helix and β-sheet structures, the CD spectra of these structures are unique. However, as many proteins have complex or less common secondary structures, it is not always possible to identify these structures directly from the obtained CD spectra. To derive the different constituents of the secondary structure of a particular protein, the acquired protein CD spectrum is typically analyzed with specialized software leveraging distinct algorithms, such as K2D2, K2D3, and BeStSel.21,26,27 For instance K2D2 uses a self-organizing map algorithm, which is a form of neural network, to derive the secondary structure maps of proteins and estimate the content of α-helix and β-strand. K2D3, which is an improvement of K2D2, capitalizes on a k-nearest neighbors algorithm to predict protein secondary structures. It is, nonetheless, important to note that the estimated secondary structural components are heavily influenced by the selection of the software, which may render the eventual estimations less reliable.
In recent years, there have been active explorations into the development and applications of machine learning to streamline the design and engineering of biomaterials.28–35 Specifically, numerous machine-learning-based techniques have been developed to uncover previously unknown biomaterials as well as to predict their physicochemical properties, particularly from their optical spectra.36–39 For example, in a recent study, a one-dimensional (1D) convolutional neural network was implemented to distinguish between different organic compounds based on their near-infrared spectra.40 In a separate work, a supervised random forest classifier was used to identify organic compounds from their visible light spectra.41 With its capability to recognize subtle spectral attributes and estimate the compositions and structures of samples from their optical spectra, machine learning has emerged as a promising tool to examine a wide range of spectral data, including CD spectra.
In this study, we sought to leverage supervised machine learning to analyze the far-UV (190–260 nm) CD spectra of 112 proteins and identify their predominant secondary structures. We systematically assessed the predictive performance of several supervised classifiers using numerous spectral, structural, and molecular features. Ultimately, we demonstrated optimal combinations of supervised learning algorithms and descriptors to realize reliable and computationally cost-effective predictions of protein secondary structures.
Proteins with primarily α-helical structure typically have two negative peaks at about 208 and 222 nm as well as a positive peak at about 190 nm in their far-UV CD spectra (Fig. 1a). In contrast, the CD spectra of proteins with a predominantly β-sheet structure have a negative peak at around 210–220 nm and a positive peak at approximately 195–205 nm. Proteins with other secondary structures, such as those with less common secondary structures or those with a mixture of α-helical and β-sheet structures, displayed more complex CD spectra with less identifiable unique spectral features. We next examined the secondary structural contents of the three classes of proteins (Fig. 1b–f). As anticipated, the α-helical proteins had the highest content of α-helix (Fig. 1b), but the lowest content of β-strand (Fig. 1c), as compared to the other two classes of proteins. In contrast, the α-helix and β-strand contents of the β-sheet proteins were the lowest and highest, respectively, among the three protein classes. A deeper analysis of other secondary structural components of the proteins revealed that the α-helical proteins had the lowest loop and bend contents (Fig. 1d and e), while those of proteins with β-sheet and other secondary structures were not statistically significantly different. Next, we sought to assess the molecular properties of the three classes of proteins, particularly their molecular weight, number of residues, and mean residue weight. Intriguingly, we noted that these proteins had comparable molecular weight, number of residues, and mean residue weight (Fig. 1g–i), suggesting that there may not be a direct correlation between these molecular properties and the secondary structure of proteins.
After statistically assessing the structural and molecular features of all proteins, we sought to evaluate if supervised learning algorithms could be employed to predict the secondary structures of these proteins. Seven distinct algorithms were selected, notably logistic regression, random forest, gradient boosting, extreme gradient boosting, k-nearest neighbors, support vector machine, and neural network. For all supervised learning analysis, the datasets were randomly split into 75% training and 25% testing sets, which were used for classifier training and testing, respectively.
To begin with, we were motivated to examine the predictive capacity of the algorithms based on various protein structural and molecular features (Fig. 2). Here, all datasets comprised eight features, i.e., molecular weight, number of residues, mean residue weight, α-helix, β-strand, loop, bend, and bonded turn contents, which collectively served as the inputs to the classifiers, while the predicted protein classes were the target outputs. To ascertain if particular features had a higher correlation with the target protein classes, we quantitatively scored and ranked the eight structural and molecular features (Fig. 2a). Based on information gain ratio, which is one of the common metrics used to analyze variable importance, β-strand and α-helix contents emerged as the two highest ranked attributes. In fact, with gain ratios of 0.516 and 0.492, respectively, which were significantly much higher than those of the other features, β-strand and α-helix contents were highly discriminatory, which could be capitalized on to delineate protein classes effectively. Analysis of the distribution of all proteins in a two-dimensional (2D) space using t-distributed stochastic neighbor embedding (t-SNE) revealed that the three protein classes were well separated with negligible overlapping (Fig. 2b). Leveraging the training dataset, we proceeded to tune the different hyperparameters of all supervised learning algorithms to produce the highest classification metrics (ESI Fig. S1†). We observed that, except for k-nearest neighbors, the other six algorithms had outstanding predictive capacity, where all values of area under the curve (AUC) were above 0.96. Specifically, gradient boosting, extreme gradient boosting, and random forest had AUCs of more than 0.98 and classification accuracy, F1 values, precision, and recall of at least 0.94. A closer examination of the classification performance of the best performing algorithm, i.e., gradient boosting, unveiled that all 24 α-helical proteins, 28 β-sheet proteins (out of a possible 29), and 27 proteins with other secondary structures (out of a possible 31) were correctly classified. Based on the optimized algorithm hyperparameters and testing dataset, we then assessed the algorithm classification metrics, and noted that gradient boosting, random forest, and extreme gradient boosting were the best performing classifiers (Fig. 2c). This corroborated our observation during classifier training. The three algorithms showed perfect AUCs of 1.000. In fact, as the best performing classifier, gradient boosting predicted all protein classes correctly, which was reflected in the values of its accuracy, precision, recall, and F1 score, as well as its confusion matrix (Fig. 2d). It is important to highlight that, as compared to their predictive performance against the training dataset, most of the algorithms showed much improved classification capacity against the testing dataset.
Next, we aimed to assess if similar excellent predictive performance demonstrated by most of the supervised learning algorithms could be maintained if the structural and molecular features were replaced with spectral features (Fig. 3). Here, the far-UV CD spectra of all proteins from 190 nm to 260 nm were employed. With a step size of 1 nm, both the training and testing datasets comprised 71 features, where each feature corresponded to specific wavelength. Quantitative scoring of all spectral features based on information gain ratio did not reveal features with especially high discriminatory power (Fig. 3a). However, we still noted that spectral attributes at wavelengths between 225 and 237 nm, 206 and 209 nm, as well as 191 and 193 nm ranked highly as variables with higher correlation with the target protein classes, while those at wavelengths from 250 to 260 nm appeared to have no correlation. Through t-SNE, we observed the distinct 2D distribution of the three protein classes (Fig. 3b). Based on the tuned algorithm hyperparameters using the training dataset, gradient boosting emerged as the best performing classifier with an AUC of 0.853, an accuracy of 0.690, and a precision of 0.703 (ESI Fig. S2†). This classifier correctly predicted 19 α-helical proteins (out of a possible 24), 19 β-sheet proteins (out of a possible 29), and 20 proteins with other secondary structures (out of a possible 31). Support vector machine and neural network were the next best performing classifiers with AUCs of at least 0.83 and accuracy and precision values above 0.660. As opposed to those obtained based on structural and molecular features, all quantitative metrics of most of the classifiers acquired using spectral features decreased substantially. This trend was also reflected in the classification performance of all algorithms against the testing dataset (Fig. 3c). Assessment of the confusion matrix of gradient boosting, which was the best performing classifier, unveiled that five α-helical proteins (out of a possible seven), five β-sheet proteins (out of a possible nine), and eight proteins with other secondary structures (out of a possible 12) were correctly classified (Fig. 3d).
Although the secondary structures of proteins are typically inferred through their CD spectra, it is noteworthy that not all spectral information is useful for estimating the secondary structural contents. For instance, the α-helical structure of proteins can be identified from their CD spectra based on only the two negative peaks at approximately 208 and 222 nm and a positive peak at around 190 nm. This suggests that the spectral components at other wavelengths have less contribution toward the delineation of this class of proteins. Additionally, depending on the experimental parameters and conditions (e.g., spectral acquisition step size, quality of instruments, and so on), the acquired CD spectra may contain interfering signals. Therefore, all these factors may collectively explain the relatively lower values of the quantitative metrics of all supervised classifiers when the full spectral descriptors were employed in place of the structural and molecular descriptors.
Next, as an indirect comparison, we were motivated to assess the performance of some of the commonly used platforms in predicting protein secondary structures. To this end, we sought to infer the secondary structures of several proteins by estimating their structural contents using K2D2 and K2D3 (ESI Excel File 2†). Since both platforms can only estimate the percentages of α-helix and β-sheet from CD spectra, we specifically selected proteins with predominantly those two secondary structures from the testing dataset for evaluations. Intriguingly, we noted that the estimated secondary structural contents were far from satisfactory. For instance, for all seven α-helical proteins, the calculated percentages of α-helix were substantially lower than those of β-sheet, suggesting that the platforms classified these proteins as those with predominantly β-sheet configuration. For the nine β-sheet proteins, despite the considerably higher calculated percentages of β-sheet than those of α-helix, we observed a low fidelity between the predicted and input CD spectra. It is also worth highlighting that, since the CD spectrum can only be analyzed one at a time, the process is time-consuming, especially when dealing with a large number of spectra.
To further understand the classifier predictive ability, we next sought to assess how the classification metrics of all supervised learning algorithms would be influenced by adding structural and molecular descriptors to the full spectral features (Fig. 4). Here, both the training and testing datasets had 79 features. As anticipated, β-strand and α-helix contents, followed by the spectral features at 225–237 nm, 206–209 nm, and 191–193 nm were the most highly ranked variables (Fig. 4a). Like in our previous observations, the three classes of proteins were distinctly separated with minimal overlapping in a 2D space (Fig. 4b). Leveraging the optimized algorithm hyperparameters and the training dataset, we noted that extreme gradient boosting emerged as the best performing algorithm with an AUC of 0.980 and classification accuracy, F1 score, precision, and recall of around 0.952 (ESI Fig. S3†). Gradient boosting and random forest were the next best performing algorithms with AUCs of 0.975 and 0.945 ± 0.005, respectively. In contrast, the k-nearest neighbors algorithm was the worst performing algorithm with an AUC of 0.589. Analysis of the confusion matrix of extreme gradient boosting elucidated that this algorithm correctly classified all α-helical proteins, 28 β-sheet proteins (out of a possible 29), and 28 proteins with other secondary structures (out of a possible 31). Based on the tuned hyperparameters, we then characterized the algorithm predictive capacity using the testing dataset. Similarly, extreme gradient boosting and gradient boosting showed the highest classification metrics with AUCs of at least 0.998 and accuracy, F1 values, precision, and recall of at least 0.964 (Fig. 4c). Meanwhile, k-nearest neighbors was again the worst performing algorithm. Analysis of the confusion matrix of extreme gradient boosting revealed that this algorithm correctly classified all α-helical and β-sheet proteins and missed only one protein with other secondary structures (Fig. 4d).
It is intriguing to note that the introduction of the eight structural and molecular descriptors significantly improved the quantitative metrics of most supervised classifiers. For instance, against the testing datasets, the AUCs of extreme gradient boosting, gradient boosting, and logistic regression increased from 0.724, 0.837, and 0.790, respectively, to higher than 0.980 (Fig. 3c and 4c). In fact, all quantitative metrics of the seven classifiers were enhanced considerably, except for the precision of k-nearest neighbors. Against the training datasets, however, all the quantitative metrics of k-nearest neighbors decreased substantially, while a reverse trend was noted from those of the other six classifiers (ESI Fig. S2 and S3†).
As highlighted previously, the secondary structures of proteins are commonly estimated from their characteristic CD spectra. While the full CD spectra with many spectral attributes contain a huge amount of information, some of the information are redundant and not all are essential for delineating protein secondary structures. Moreover, some of the full CD spectra may contain noise, which may be introduced during the spectral acquisition process. This may then complicate the characterization of protein structures. As such, it is important to extract only the most discriminatory features from the full CD spectra to improve the quality and reliability of protein secondary structure identification.
To this end, in this part of our study, we sought to transform the high-dimensional full spectral datasets into their low-dimensional counterparts and assess the classification performance of all classifiers against these newly generated datasets (Fig. 5). To start with, through principal component analysis (PCA), we reduced the dimensionality of all CD spectra from 71 wavelength components to six principal components. The number of principal components was especially selected to account for more than 90% of spectral data variance (ESI Fig. S4†). With gain ratios of 0.135, 0.102, and 0.099, the first, fifth, and second principal components, i.e., PC1, PC5, and PC2, respectively, were the most important features (Fig. 5a). Although PC2 ranked lower than PC5 in terms of gain ratio, it is noteworthy that PC2 captured about 26% of spectral data variance while PC5 accounted for only about 5%. As such, capitalizing on only PC1 and PC2 to create a score plot, we visualized the 2D distribution of the three protein classes, where a high degree of separation with some overlapping was noted (Fig. 5b). Further interrogation of protein distributions using PC1 and PC2 showed the highly distinct separation of α-helical and β-sheet proteins (Fig. 5c). Nonetheless, the distributions of proteins with other secondary structures overlapped to a certain degree with those of the other two protein classes. Assessment of the loading spectra of both PC1 and PC2 revealed unique trends in the contributions of each spectral attribute to the two principal components (Fig. 5d).
Adopting a similar approach, we then employed the training dataset to optimize various algorithm hyperparameters and evaluated the generated classification parameters (ESI Fig. S5†). All trained classifiers showed good predictive capacity. Of all algorithms, support vector machine demonstrated the highest quantitative metrics with an AUC of 0.841 and an accuracy and a precision of close to 0.7. This classifier managed to correctly classify 18 α-helical proteins (out of a possible 24), 19 β-sheet proteins (out of a possible 29), and 19 proteins with other secondary structures (out of a possible 31). As the next best performing classifiers, extreme gradient boosting, random forest, and gradient boosting had AUCs of above 0.83, although random forest had much higher mean values of accuracy, F1 score, precision, and recall. Against the testing dataset, neural network showed the highest AUC of 0.839 (Fig. 5e). With this classifier, six α-helical proteins (out of a possible seven), five β-sheet proteins (out of a possible nine), and seven proteins with other secondary structures (out of a possible 12) could be correctly predicted (Fig. 5f). Gradient boosting and support vector machine had the next highest AUCs of 0.827 and 0.799 ± 0.002, respectively.
Interestingly, examining the classifier performance on the training datasets as the full spectral descriptors were switched to their dimensionally reduced counterparts, we noted that, both neural network and logistic regression showed substantially improved accuracy, F1 values, precision, and recall (ESI Fig. S2 and S5†). This trend became increasingly more apparent when analyzing the classifier performance against the testing datasets (Fig. 3 and 5). In fact, three of the evaluated classifiers, i.e., neural network, logistic regression, and k-nearest neighbors, had enhanced AUC, accuracy, precision, recall, and F1 values. Support vector machine, while having the same accuracy and recall values, also had improved AUC and precision with the use of dimensionally reduced spectral data.
It is crucial to highlight that the analysis of full spectral data using machine learning may require significant computational resources and processing time, especially if the datasets consist of a huge number of entries and spectral features. The high dimensionality of these datasets may render classifier training and generalization challenging. Furthermore, for certain algorithms with numerous tunable hyperparameters like neural network, computational effort and processing time increase substantially along with an increase in certain hyperparameter values (e.g., number of neurons in hidden layers). The use of dimensionality reduction methods, such as PCA, can address some of the highlighted issues to a certain extent. As illustrated in this study, most of the important spectral information (more than 90% of data variance) could still be captured with the use of only six principal components, as opposed to the full 71 spectral features. The more than 10-fold reduction in the dataset dimensionality significantly reduced the computational effort and analysis time of all classifiers. In addition, the predictive capacity of some classifiers could be augmented considerably with the use of dimensionally reduced datasets. All these collectively demonstrate the advantages of coupling dimensionality reduction techniques with supervised machine learning to enhance the prediction of protein secondary structures.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5bm00153f |
This journal is © The Royal Society of Chemistry 2025 |