Song
Yin†
a,
Xuenan
Mi†
b and
Diwakar
Shukla
*abc
aDepartment of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Urbana 61801, Illinois, USA. E-mail: diwakar@illinois.edu
bCenter for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
cDepartment of Bioengineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
First published on 13th March 2024
Peptides play a pivotal role in a wide range of biological activities through participating in up to 40% protein–protein interactions in cellular processes. They also demonstrate remarkable specificity and efficacy, making them promising candidates for drug development. However, predicting peptide–protein complexes by traditional computational approaches, such as docking and molecular dynamics simulations, still remains a challenge due to high computational cost, flexible nature of peptides, and limited structural information of peptide–protein complexes. In recent years, the surge of available biological data has given rise to the development of an increasing number of machine learning models for predicting peptide–protein interactions. These models offer efficient solutions to address the challenges associated with traditional computational approaches. Furthermore, they offer enhanced accuracy, robustness, and interpretability in their predictive outcomes. This review presents a comprehensive overview of machine learning and deep learning models that have emerged in recent years for the prediction of peptide–protein interactions.
The rational design of peptide drugs is challenging and costly, due to the lack of stability and the big pool of potential target candidates. Therefore, computational methodologies that have proven effective in small molecule drug design have been adapted for modelling peptide–protein interactions (PepPIs). These computational techniques include docking, molecular dynamics (MD) simulations, and machine learning (ML) and deep learning (DL) models. Docking approaches enable exploration of peptide binding positions and poses in atomistic details, facilitating the prediction of binding affinities.5–9 However, peptides are inherently flexible and they can interact with proteins in various conformations. These conformations often change during the binding process.10 MD simulation is another approach to model the peptide–protein interaction. The peptide–protein binding and unbinding process can be studied thermodynamically and kinetically through MD simulations.10–18 But sampling the complex energy landscapes associated with peptide–protein interactions typically requires intensive computational resources and time. The accuracy of both docking and MD simulations relies on the knowledge of protein structures, but the limited availability of peptide–protein complex structures has restricted the utility of these two approaches.
In recent years, ML and DL models have been widely used in the field of computer-aided drug design. These models offer an alternative way to address the inherent challenges associated with docking and MD simulations in modeling PepPIs. Due to the large amount of available biological data, many ML/DL models are routinely employed to obtain sequence–function relationship, achieving comparable predictive performance to structure-based models. This is because sequence data contain evolutionary, structural and functional information across protein space. Furthermore, compared with docking and MD simulations, ML/DL models exhibit greater efficiency and generalizability. Trained ML/DL models are capable of predicting PepPIs in a single pass, but it is hard to do large-scale docking and MD simulations due to their resource-intensive and time-consuming nature. Moreover, with the development of interpretable models, DL models are no longer regarded as black boxes; they can provide valuable insights into residue-level contributions to peptide–protein binding predictions.
Previous reviews have mainly summarized ML/DL models for predicting PPIs.19–24 They have traditionally categorized computational methods for predicting PPIs into two main classes: sequence-based and structure-based approaches. Sequence-based methods extract information only from sequence data, whereas structure-based methods rely on the information derived from peptide–protein complex structures. Recently, ML/DL models have increasingly integrated both sequence and structure information to enhance their predictive performance. In this review, we systematically summarize the progress made in predicting PepPIs. From ML perspective, we include support vector machine (SVM) and random forest (RF). ML models typically require manual feature extraction from sequence and structure datasets. But DL models, including convolutional neural network (CNN), graph convolutional network (GCN) and transformer, automatically extract multi-layer feature representations from data. To the best of our knowledge, this is the first review to summarize the ML/DL work for specifically predicting PepPIs. Fig. 1 shows the timeline illustrating the evolution of ML/DL methods in the context of PepPI predictions. Table 1 summarizes the details of ML/DL models discussed in this review.
Model name | Baseline model | Data type and datasets | Key ideas | Model performance |
---|---|---|---|---|
Abbreviations: ACC: accuracy; AUC: area under the ROC curve; AUPR: area under the precision–recall curve; MCC: Matthews correlation coefficient; SEN: sensitivity; SP: specificity; PRE: precision; SR: success rate. | ||||
SPRINT-Seq25 | SVM | Protein sequences from the BioLip26 protein sequence | First ML model predicted PepPIs only based on sequence features | ACC: 0.66, AUC: 0.71, MCC: 0.33, SEN: 0.64, SP: 0.68 |
PepBind27 | SVM | Protein sequences from BioLip26 | Intrinsic disorder-based features were first introduced | AUC: 0.76, MCC: 0.33, SEN: 0.32, PRE: 0.45 |
SPRINT-Str28 | RF | Protein–peptide complex sequences and structures from BioLip26 | Used structural information and employed the RF classifier | ACC: 0.94, AUC: 0.78, MCC: 0.29, SEN: 0.24, SP: 0.98 |
InterPep29 | RF | Protein–peptide complex structures from RCSB PDB30 | Predicted what region of the protein structure the peptide is most likely to bind | ACC: 0.81, SEN: 0.51 |
SPPPred31 | Ensemble: SVM, RF, KNN | Protein sequences from the BioLip database26 | Ensemble learning model was applied for effectively handling imbalanced dataset | ACC: 0.95, AUC: 0.71, MCC: 0.23, F1: 0.31, SEN: 0.32, SP: 0.96 |
Hierarchical statistical mechanical (HSM)32 | HSM | Peptide binding domain (PBD)-peptide structures from UniProt33 | Introduced bespoke HSM model to predict the affinities of peptide binding domain (PBD)–peptide interactions | AUC: 0.97 (PBD: PDZ) |
Visual34 | CNN | Protein sequences from BioLip26 | Protein sequence features were transformed into images and CNN was first applied to predict PepPIs | AUC: 0.73, MCC: 0.17, SEN: 0.67, SP: 0.69 |
BiteNetPp35 | CNN | Protein–peptide complex structures from BioLip26 | Utilized 3D CNN and protein structures directly to predict protein–peptide binding sites | AUC: 0.91, MCC: 0.49, PRE: 0.53 |
InterPepRank36 | GCN | Protein–peptide complex structures from RCSB PDB30 | Achieves high accuracy in predicting both binding sites and conformations for disordered peptides | AUC: 0.86 |
ScanNet37 | Geometric DL Architecture | Protein–peptide complex structures from Dockground38 | An end-to-end, interpretable geometric DL model that learns features directly from 3D structures | ACC: 0.88, AUC: 0.69, SEN: 0.50, PRE: 0.74 |
Struct2Graph39 | GCN and attention | Protein–peptide complex structures from IntAct,40 STRING,41 and UniProt33 | A GCN-based mutual attention classifier accurately predicting interactions between query proteins exclusively from 3D structural data | ACC: 0.99, AUC: 0.99, MCC: 0.98, F1: 0.99, SEN: 0.98, SP: 0.99, PRE: 0.99, NPV: 0.98 |
CAMP42 | CNN and self-attention | Protein–peptide complex sequences from RCSB PDB30 and DrugBank43 | Took account of sequence information of both proteins and peptides, and identified binding residues of peptides | AUC: 0.87, AUPR: 0.64 |
PepNN44 | Transformer | Protein–peptide complex sequences and structures from RCSB PDB30 | Utilized a multi-head reciprocal attention layer to update the embeddings of both peptides and proteins; transfer learning was applied to solve the limited protein–peptide complex structure issue | AUC: 0.86, MCC: 0.41 |
PepBCL45 | BERT-based contrastive learning framework | Protein sequences from the BioLip database26 | An end-to-end predictive model; contrastive learning module was used to tackle the imbalanced data issue | AUC: 0.82, MCC: 0.39, SEN: 0.32, SP: 0.98, PRE: 0.54 |
AlphaFold monomer46–48 | MSA based transformer | Protein sequences and structures from Uniclust3049 and RCSB PDB30 | Adding the peptide sequence via a poly-glycine linker to the C-terminus of the receptor monomer sequence could mimic peptide docking as monomer folding | SR: 0.75 (within 1.5 Å RMSD) in Tsaban et al.47 and SR: 0.33 (fraction of native contacts = 0.8 as cutoff) in Shanker and Sanner48 |
OmegaFold48,50 | Protein language model | Protein sequences and structures from Uniref50,51 RSCB PDB,30 CASP,52 and CAMEO53 | SR: 0.20 (fraction of native contacts = 0.8 as cutoff) in Shanker and Sanner48 | |
AlphaFold multimer48,54 | MSA based transformer | Protein complex sequences and structures from RSCB PDB30 and Benchmark 255 | Improved the accuracy of predicted multimeric interfaces between two or more proteins | SR: 0.53 (fraction of native contacts = 0.8 as cutoff) in Shanker and Sanner48 |
Fine-tuned AlphaFold56 | MSA based transformer | Peptide–MHC complex structures from RSCB PDB30 | Leveraging and fine-tuning AF2 with existing peptide–protein binding data could improve its PepPI predictions | AUC: 0.97 (class 1) and AUC: 0.93 (class 2) |
SPRINT-Seq (Sequence-based prediction of Protein–peptide Residue-level INTeraction sites) is the first ML based prediction of peptide–protein binding sites only using sequence features.25 Various types of information were extracted from protein sequences to create a feature dataset, including one-hot encoded protein sequences, evolutionary information,57 predicted accessible surface area,58 secondary structures,58 and physiochemical properties.59 These features were fed into a classification model, SVM, to predict the label for each residue (Fig. 2). SPRINT-Seq yielded a Matthews correlation coefficient (MCC) of 0.326, a sensitivity of 0.64 and a specificity of 0.68 on an independent test set. The importance of each feature was also evaluated; the most crucial feature distinguishing binding from non-binding residues is the sequence evolution profile. This sequence-based technique's performance is comparable to or better than that of structure-based models (Peptimap,60 Pepite,61 PinUp,62 VisGrid63) for peptide-binding site prediction.
To improve the accuracy of sequence-based prediction, Zhao et al. introduced intrinsic disorder as a feature within sequence representation.27 Peptides that participate in peptide–protein interactions exhibit consistent attributes of short linear motifs, primarily found in the intrinsic disordered regions (IDRs). These attributes include short length, flexible structure and weak binding affinity.64 In addition to the novel sequence representation, they designed a consensus-based method called PepBind.27 This method combines the SVM classification model with the template-based methods S-SITE and TM-SITE.65 The combination of these three individual predictors yielded better performance than all three individual methods and outperformed the first sequence-based method SPRINT-Seq.
A RF model, SPRINT-Str28 (Structure-based Prediction of Residue-level INTeraction), was developed to predict the putative peptide–protein binding residues and binding sites by combining both sequence-based and structure-based information. The sequence information in the input includes the position specific scoring matrix (PSSM) for all amino acids in the protein and entropy calculated based on the PSSM. Structural information includes accessible surface area (ASA) calculated by DSSP (define secondary structure of proteins),66 secondary structure (SS) calculated by DSSP,66 half-sphere exposure (HSE) representing the solvent exposure using residue contact numbers in upward and downward hemispheres along with the pseudo Cβ–Cα bond,67 and flexibility calculated by iModeS68 to describe the functional motions of proteins.69 A RF classifier was further trained and tested to predict the binding residues. The density-based spatial clustering of applications with noise (DBSCAN) algorithm70 was then applied to cluster spatially neighboring binding site residues. The largest cluster was selected as the predicted binding site with a corresponding reliability score. SPRINT-Str achieved robust performance in predicting binding residues with a MCC of 0.293 as well as an area under the receiver operating characteristic curve (ROC AUC) of 0.782. For instance, when testing the model's performance on peptide binding with the human tyrosine phosphatase protein PTPN4 PDZ domain (PDBID: 3NFK),71 15 out of 17 binding residues were correctly predicted, and the predicted binding sites were similar to the actual binding sites. SPRINT-Str is one of the representative ML models that pass structural features into the models and has achieved remarkable success in predicting PepPIs.
The structures of proteins or peptide–protein complexes can also be directly used as input to ML models. The underlying premise of this approach is that, if a PepPI shares similarities with a certain interaction surface, that well-characterized surface can serve as a template for modeling other PepPIs. The InterPep model29 constructs four steps to better represent this idea: mass structural alignment (MSA), feature extraction, RF classification, and clustering. A template modeling (TM) score larger than 0.5 was used to screen out candidate templates. Overall, InterPep accurately predicted 255 out of 502 (50.7%) binding sites for the top 1 prediction and correctly identified 348 out of 502 (69.3%) binding sites within the top 5 predictions, which demonstrates that it is a useful tool for the identification of peptide-binding sites.
In the SPPPred algorithm, the ensemble learning technique of bagging73 was employed to predict peptide binding residues. The initial step in bagging involves generating various subsets of data through random sampling with replacement, a process known as bootstrapping. For each bootstrap dataset, distinct classification models are trained, including support vector machine (SVM), K-nearest neighbors (KNN), and random forest (RF). Subsequently, for each residue, the class with the majority of votes across these models is determined as the final predicted label. This ensemble method consistently demonstrates strong and comparable performance on independent test sets, with an F1 score of 0.31, an accuracy of 0.95, and an MCC of 0.23.
Wardah et al. applied CNNs for identifying peptide-binding sites by introducing a CNN-based method named Visual.34 In the Visual algorithm, features were extracted from protein sequences, like HSE,67 secondary structures,82 ASA,82 local backbone angles,82 PSSM57 and physicochemical properties.83 These features were stacked horizontally resulting in a feature vector with a length of 38. Visual employs a sliding window approach to capture the local context of each residue. For a given residue, the feature vectors of the three upstream and three downstream residues were combined into a matrix, resulting in a 2-dimensional array with a size of 7 × 38. An illustrative example of the input data in an image-like format is depicted in Fig. 3, showcasing the center residue serine (S) within a window size of 7. A 7 × 38 image is generated as the input of the CNN classifier. The Visual model comprises two sets of convolutional layers, followed by a pooling layer and a fully connected layer (Fig. 3). Visual was applied to identify the peptide binding sites of proteins and achieved a sensitivity of 0.67 and a ROC AUC of 0.73.
Fig. 3 The workflow of the Visual model. (a) Transforming the protein sequence into a 7 × 38 input image (per residue). In the order from left to right of the image: 3 pixels represent half sphere exposure (HSE),67 3 pixels represent the predicted probabilities of different secondary structures, 1 pixel represents the accessible surface area (ASA) value, 4 pixels represent the local backbone angles, 20 pixels represent the position specific scoring matrix (PSSM), and 7 pixels represent the physicochemical properties of the amino acids. (b) Training and optimizing hyperparameters of the CNN. (c) Testing the optimized CNN on unseen test data to predict the label of each residue (binding/non-binding). Adapted with permission from W. Wardah, A. Dehzangi, G. Taherzadeh, M. A. Rashid, M. Khan, T. Tsunoda and A. Sharma, J. Theor. Biol., 2020, 496, 110278. Copyright 2024 Elsevier. |
BiteNetPp35 is another CNN-based model that converts 3D protein structures to 4D tensor-based representations and feeds them into a 3D CNN to learn the probability of PepPIs and predict the peptide binding sites/domain. The 4D tensor has the first three dimensions corresponding to the x, y, and z dimensions, and the fourth dimension corresponding to 11 channels including atomic densities of 11 different atom types such as aromatic carbon, sulfur, amide nitrogen, carbonyl oxygen, and so forth. These four-dimensional tensor-based representations are then fed into 10 three-dimensional convolutional layers to obtain the probability score of “hot spots”, which are determined as the geometric centers of each segmented peptide–protein interface. This model outperforms SOTA methods with a ROC AUC of 0.91 and a MCC of 0.49. The model showed promising power for the prediction of peptide–protein binding sites, but the model's performance is limited by the input protein orientation and sensitivity to the protein conformations. Therefore, BiteNetPp could be improved by using representations that could handle the protein rotation invariance.
InterPepRank36 is a representative GCN that has been developed to predict the PepPIs. In this model, billions of decoys (computational protein folding structure) were generated by the PIPER93 docking tool as the training and testing set, respectively. The peptide–protein complexes were then represented as graphs with one-hot encoded nodes illustrating individual residues, PSSM,94 and self-entropy,94 and one-hot encoded edges denoting the residue interactions. Both node and edge features were then passed through edge convolution layers with the output from each layer concatenated and fed into a global pooling layer and two dense layers to predict the LRMSD (ligand root-mean-square deviation) of decoys. InterPepRank achieved a median ROC AUC of 0.86, outperforming other benchmarking methods such as PIPER,93 pyDock3,95 and Zrank.96 For example, in the case of a fragment from the center of troponin I (peptide) binding with the C-terminal domain of Akazara scallop troponin C (receptor),97 the peptide was proved to be disordered when unbound and become an ordered α-helical structure upon binding,98 following the induced-fit binding mechanism. Predicting the peptide binding conformation and binding sites for systems with induced-fit mechanisms is extremely challenging. The top 100 decoys predicted by both InterPepRank and Zrank showed that both methods can find the true binding site of the peptide. However, InterPepRank achieved an accuracy of 96% in predicting the peptide as an α-helical structure, while Zrank only achieved an accuracy of less than 50%, where half of the peptide decoys’ secondary structures were predicted as either random coils or β-sheets. Therefore, InterPepRank is a powerful tool for predicting both binding sites and conformations, even in cases where the peptide is disordered when unbound. This is a significant advantage over other benchmarked energy-based docking methods, which may struggle with disordered structures that are more energetically favorable in unbound states or easier to fit into false positive binding sites.
Struct2Graph39 is a novel multi-layer mutual graph attention convolutional network for structure-based predictions of PPIs (Fig. 4). Coarse-grained graph embeddings were generated by two GCNs with weight sharing for both components of the protein complexes. These embeddings were then passed through a mutual attention network to extract the relevant features for both proteins and concatenated into a single embedding vector. Attention weights and context vectors were calculated from the GCN-transformed hidden embeddings. Residues with large learned attention weights are more important and more likely to contribute towards interaction. The context vectors were concatenated and further passed into a feed-forward network (FFN) and a final Softmax layer to get the probability for PPI. Struct2Graph outperformed the feature-based ML models and other SOTA sequence-based DL models, achieving an accuracy of 98.89% on a positive/negative sample balanced dataset and an accuracy of 99.42% on a positive/negative sample unbalanced dataset (positive:negative = 1:10). Residue-level interpretation was conducted to identify the residues’ contribution to PepPIs. For example, Staphylococcus aureus phenol soluble modulins (PSMs) peptide PSMα199 competes with the high mobility group box-1 protein (HMGB1) to bind with toll-like receptor-4 (TLR4),100 thus inhibiting HMGB1-mediated phosphorylation of NF-κB.101 For the PSMα1-TLR4 complex, Struct2Graph demonstrated an impressive accuracy of 92%, and the predicted binding residues aligned with the previously identified TLR4 active binding sites. Notably, peptide residues 2Gly and 10Val were accurately predicted as the peptide binding residues. Furthermore, Struct2Graphs predictions corroborated the previously studied competitive binding mechanism, indicating that both PSMα1 peptide and HMGB1 bind to the same area of TLR4.
Interpretable DL graph models have also been employed for the PepPI predictions. Recently, an end-to-end geometric DL architecture known as ScanNet (Spatio-chemical arrangement of neighbors neural NETwork)37 was developed that integrated multi-scale spatio-chemical arrangement information of atoms and amino acids, along with multiple sequence alignment (MSA) for detecting protein–protein binding sites (PPBS). The model took the protein sequence, tertiary structure, and optionally position-weight matrix from MSA of evolutionarily related proteins as input. It first extracted all the atomic neighborhood embeddings, which were then passed through several filters to learn the atomic scale representations. To further reduce the dimensions, atom-wise representations were pooled at the amino acid scale, mixed with extracted amino acid information, and fed into trainable filters to yield amino acid scale representations (Fig. 5(a)). With these representations containing multi-scale spatio-chemical information, ScanNet was trained for the prediction of PPBS on 20k proteins with annotated binding sites. When compared with the traditional ML method XGBoost with handcrafted features, and designed pipeline based on structural homology, ScanNet achieved the highest accuracy of 87.7%. While the structural homology baseline performed almost the same as ScanNet, the accuracy dropped quickly when meeting with the unseen fold during the test because of its strong dependence on the homology that was previously developed. Therefore, it is crucial to understand what ScanNet has actually learned. Specifically, does the network only memorize the training data, or does it really understand the underlying protein–protein binding principles? Detailed visualization and interpretation were explored to illustrate the learned atom-wise representations and amino acid-wise representations. The network has learned different atomic patterns, such as the N–H–O hydrogen bond (Fig. 5(b)), the SH or NH2 side-chain hydrogen donor surrounded by oxygen atoms (Fig. 5(c)), a carbon in the vicinity of a methyl group and an aromatic ring (Fig. 5(d)), and so on. The detected pattern with solvent-exposed residues frequently appearing in the protein–protein interface (Fig. 5(e)), such as arginine (R), was positively correlated with the output probability of PPBS. However, that with the buried hydrophobic amino acids (Fig. 5(f)), such as phenylalanine (F), was negatively correlated with the output probability of PPBS. Interestingly, the pattern with the exposed hydrophobic amino acid surrounded by charged amino acids, which is the hotspot O-ring102 architecture in protein interfaces, was positively correlated with the output probability (Fig. 5(g)). 2D t-distributed stochastic neighbor embedding (t-SNE) projections further verified that the model has already learned various amino acid-level structural features. 2D t-SNE projections on secondary structures (Fig. 5(h)) clearly illustrated that the model has learned the secondary structural information of the training complexes. With the multi-level knowledge of protein structures, ScanNet captures the underlying chemical principles of protein–protein binding. This SOTA interpretable DL model aids in a deeper understanding of PepPIs and PPIs.
Existing ML and DL models for predicting peptide–protein binding sites mainly focus on identifying binding residues on the protein surface. Sequence-based methods typically take protein sequences as inputs, assuming that a protein maintains fixed binding residues across different peptide binders. However, this assumption doesn’t hold true for most cellular processes, as various peptides may interact with distinct protein residues to carry out diverse functions. Structure-based methods would require a target protein structure and a peptide sequence, thus limiting their applicability to proteins with available structural data. A novel DL framework for peptide–protein binding prediction was proposed, called CAMP,42 to address the above limitations. CAMP takes account of information from the sequence of both peptides and target proteins, and also detects crucial binding residues of peptides for peptide drug discovery.
CAMP extracted data from difference sources, including RCSB PDB30,106 and the known peptide drug–target pairs from DrugBank.43,107–110 For each PDB complex, protein–ligand interaction predictor (PLIP) is employed to identify non-covalent interactions between the peptide and the protein, considering these interactions as positive samples for training. Additionally, PepBDB111 aids in determining the binding residues of peptides involved in the specific protein–peptide complexes. Various features are extracted based on their primary sequences to construct comprehensive sequence profiles for peptides and proteins. These features include secondary structures, physicochemical properties, intrinsic disorder tendencies, and evolutionary information.27,112–115 CAMP utilizes two multi-channel feature extractors to process peptide and protein features separately (Fig. 6). Each extractor contains a numerical channel for numerical features (PSSM and the intrinsic disorder tendency of each residue), along with multiple categorical channels for diverse categorical features (raw amino acid, secondary structure, polarity and hydropathy properties). Two CNN modules extract hidden contextual features from peptides and proteins. Self-attention layers are also employed to capture long-range dependencies between residues and assess the contribution of each residue to the final interaction. CAMP applies fully connected layers on all integrated features to predict the interaction between proteins and peptides. In addition to binary interaction prediction, CAMP can identify which residue of peptides interacts with target proteins by adding a sigmoid activation function to the output of the peptide CNN module. Compared with three baseline models (DeepDTA,116 PIPR,117 NRLMF118), CAMP demonstrates consistent better performance with an increase by up to 10% and 15% in terms of area under the curve (AUC) and area under the precision–recall curve (AUPR). To evaluate its ability to identify binding residues of peptides, the predicted label of each residue of the peptide is compared with the real label for four existing peptide binders. The results show that CAMP correctly predicts binding residues and thus provides reliable evidence for peptide drug design.
Instead of only applying the self-attention layer, Adbin et al. developed a transformer-based architecture known as PepNN, enabling both sequence-based (PepNN-Seq) and structure-based (PepNN-Struct) predictions of peptide binding sites.44 PepNN takes representations of a protein and a peptide sequence as inputs and generates a confidence score for each residue, indicating the likelihood of being part of binding sites. PepNN-Struct learns a contextual representation of a protein structure through the use of graph attention layers (Fig. 7(a)). In contrast, PepNN-Seq only takes the protein and peptide sequence as inputs (Fig. 7(b)). In the PepNN algorithm, the encoding of the peptide sequence is independent from the protein encoding module, under the assumption that the peptide sequence carries all the necessary information regarding peptide–protein binding. However, in many scenarios, the peptide sequence is not sufficient to determine the bound conformation, as the same peptide can adopt different conformations when bound to different proteins.119 Motivated by this, PepNN incorporates a multi-head reciprocal attention layer that simultaneously updates the embeddings of both the peptide and protein (Fig. 7(a)). This module attempts to learn the interactions between protein and peptide residues involved in binding.
Another challenge in predicting the protein–peptide binding sites is the limited availability of protein–peptide complex training data. Protein–protein complex information was added to the training set to overcome the limited data issue. Notably, not the entire protein–protein complex data were included, because the interactions between two proteins can be mediated by a linear segment in one protein that contributes to the majority of the interface energy. Pre-training of the model was conducted using a substantial dataset of large protein fragment–protein complexes (717932).120 Fine-tuning of the model then took place with a smaller set of peptide–protein complexes (2828), resulting in a considerable enhancement in predictive performance, particularly for the PepNN-Struct model (Fig. 7(c)). PepNN reliably predicts peptide binding sites on an independent test set and three benchmark datasets from the other studies.27–29 PepNN-Struct surpassed most peptide binding site prediction approaches, achieving a higher AUC score. While PepNN generally exhibits lower MCC than the SOTA method AlphaFold-Multimer in most cases, its independence from multiple sequence alignments may render PepNN more suitable for modeling synthetic PepPIs.
While numerous computational methods have been developed for predicting peptide–protein binding sites, many of them need complex data preprocessing to extract features, often resulting in reduced computational efficiency and predictive performance. Wang et al. developed an end-to-end predictive model that is independent of feature engineering named PepBCL.45 This innovative approach leverages pre-trained protein language models to distill knowledge from protein sequences that are relevant to protein structures and functions. Another challenge encountered in identifying protein–peptide binding sites is the issue of imbalanced data. Current work typically constructs a balanced dataset by using under-sampling techniques. However, these techniques remove samples from the majority class to match the size of the minority class. In the PepBCL algorithm, a contrastive learning-based module is introduced to tackle this problem. Unlike conventional under-sampling methods, the contrastive learning module adaptively learns more discriminative representations of the peptide binding residues.
The PepBCL architecture is composed of four essential modules: sequence embedding module, BERT-based encoder module,105 output module and contrastive learning module.121,122 In the sequence embedding module, each amino acid of the query sequence is encoded into a pre-trained embedding vector, while the protein sequence is encoded to an embedding matrix. In the BERT-based encoder module, the output from the sequence embedding module undergoes further encoding through BERT to generate a high dimensional representation vector.123 The representation vector is then passed through a fully connected layer. In the contrastive learning module, the contrastive loss between any two training samples is optimized to generate more discriminative representations of the binding residues. In the output module, the probability of each residue being in a binding site is calculated (Fig. 8(a)). When compared with the existing sequence-based method (SPRINT-Seq,25 PepBind,27 Visual,34 and PepNN-Seq44), PepBCL achieves a significant improvement in the precision by 7.1%, AUC by 2.2%, and MCC by 1.3% over best sequence predictor PepBind.27 Furthermore, PepBCL also outperforms all structure-based methods (i.e. Pepsite,61 Peptimap,60 SPRINT-Str,28 and PepNN-Struct44) in terms of MCC. The superior performance of PepBCL indicates that DL approaches can automatically learn features from protein sequences to distinguish peptide binding residues and non-binding residues, eliminating the reliance on additional computational tools for feature extraction. When assessing various methods using evaluation metrics, it is observed that recall and MCC tend to be notably low due to the extreme class imbalance in the dataset. This suggests that many true protein–peptide binding residues may be overlooked. However, PepBCL demonstrates improved recall and MCC values, highlighting the effectiveness of the contrastive module in identifying more true peptide binding residues. This enhancement can be attributed to the contrastive learning's ability to extract more discriminative representations, particularly in imbalanced datasets. Fig. 8(b) visually demonstrates the learned feature space with and without the contrastive learning module, showcasing a clearer distribution of binding and non-binding residues in the feature space.
While the AF2 monomer was originally designed for predicting monomeric protein/peptide structures, it has recently been shown to be successful in predicting PepPIs by Tsaban et al.47 The PepPIs could be represented as the folding of a monomeric protein by connecting the peptide to the C-terminus of the receptor with a poly-glycine linker (Fig. 9(a)), which forms a general idea of how to perform peptide–protein docking using the AF2 monomer model. This method can not only identify the peptide binding regions but also accommodate binding-induced conformational changes of the receptor. AF2 surpassed RoseTTAFold since the latter tended to fold the polyglycine linker into a globular structure or various interactive loops. For a small dataset of 26 PepPI complexes, AF2 achieved a relatively high accuracy (75%) for complexes whose binding motifs have been experimentally characterized. AF2 also outperformed another peptide docking method PIPER-FlexPepDock (PFPD)127 in terms of both accuracy and speed. Furthermore, accurate predictions were achieved with AF2 pLDDT values above 0.7, further verifying that the AF2 monomer can reliably predict the PepPIs. However, the predicted accuracy became lower (37%) when tested on a larger dataset (96 complexes), indicating that further improvements are needed for more accurate PepPI predictions by the AF2 monomer.
The recent release of the AF2 multimer has resulted in a major improvement in PepPI prediction. Using a set of 99 protein–peptide complexes, Shanker and Sanner48 compared the performance of the AF2 monomer, AF2 multimer, and OmegaFold on PepPI prediction with their peptide docking software AutoDock CrankPep (ADCP).91 The new AF2 multimer model with 53% accuracy, which was trained to predict the interfaces of multimeric protein complexes, outperformed OmegaFold with 20% accuracy and ADCP with 23% accuracy (Fig. 9(b)). However, the AF2 multimer model is only limited to linear peptides, reducing its applicability to cyclized peptides, or peptides with non-standard amino acids. Effective selection from top-ranked poses yielded by both AF2 multimer and ADCP docking tool was found to further enhance the accuracy to 60%. Therefore, DL protein structure prediction models, especially AF2 multimer, have achieved high accuracy in PepPI predictions, though limitations exist. Combining these SOTA DL models with traditional peptide docking tools could be a future direction for further improving the accuracy of PepPI predictions.
Leveraging the highly accurate predictions of protein structures by AF2, Amir Motmaen et al.56 developed a more generalized model for the prediction of PepPIs. The model was accomplished by placing a classifier on top of the AF2 network and fine-tuning the combined network (Fig. 9(c)). AF2 was able to achieve optimal performance and generate the most accurate complex predicted structure models for a large dataset of peptide-major histocompatibility complex (MHC) complexes. This was accomplished by aligning the peptide sequence with the peptide–protein crystal structures as templates. However, a few misclassifications by AF2 underscored the importance of accurately distinguishing binder and non-binder peptides. To address this issue, a logistic regression layer that normalizes the AF2 predicted aligned error (PAE) score into a binder/non-binder score was placed on top of AF2. This resulted in three types of losses being combined and applied to further fine-tune the combined model: structure loss on both peptide and protein for binding peptide–protein complexes, structure loss on protein only for non-binding peptide–protein complexes, and classification loss on binding/non-binding score. The evaluation of the combined model showed a ROC AUC of 0.97 for class 1 and 0.93 for class 2 peptide–MHC interactions. Surprisingly, the fine-tuned model outperformed the previously mentioned HSM model and could also be generalized on PDZ domains (C-terminal peptide recognition domain) and SH3 domains (proline-rich peptide binding domain), despite being trained and fine-tuned only on the peptide–MHC dataset. Therefore, taking advantage of the accurate predictions of protein structures through AF2, fine-tuning the model with existing peptide–protein binding data offers significant boost to PepPI predictions.
Apart from their well-documented high efficiency and accuracy requirements, ML/DL methods offer several other advantages in the predictions of PepPIs. Compared to docking or MD simulation methods, ML or DL methods offer diverse options for model inputs. DL methods, such as transformers and language models, have been shown to achieve great success in predicting PepPIs solely based on sequence information. Instead of original sequence or structure information, ML methods can also incorporate multi-level information such as evolutionary information, secondary structures, solvent accessible surface area, and so forth, which could significantly enhance the accuracy of the prediction. Furthermore, more interpretability can be provided by ML/DL methods. The attention mechanism assists in demonstrating the internal dependencies between residues and the contribution of each residue to PepPIs. Graph models capturing multi-scale structure information of peptides and proteins are able to provide insights into the underlying chemical principles of peptide–protein binding and binding patterns. Moreover, ML/DL techniques exhibit a degree of generalizability. Some advanced techniques like transfer learning or one-shot learning models, which have been applied in protein engineering and protein–ligand interaction prediction,128–131 could facilitate the models trained on certain peptide–protein binding datasets to generalize to other peptide–protein complexes.
Despite their numerous advantages, ML and DL methods also have certain limitations in the prediction of PepPIs, which highlight potential areas for future research. One significant challenge is the issue of imbalanced datasets in the training and testing of PepPI prediction models. Given that peptide binding is typically a rare occurrence, the imbalanced number of positive and negative samples often results in the limited performance of ML/DL models due to the poor understanding of the minority binding class. Consequently, ML/DL methods for PepPI predictions were normally trained based on datasets with a positive-to-negative ratio of 1:1. Both oversampling methods, which duplicate or create new samples in the minority class, and undersampling methods, which delete or merge samples in the majority class, can enhance the model performance on imbalanced classification. Besides, challenges arise when dealing with peptides deeply embedded in the enzyme's active site especially involving cofactors. Accurate predictions for such interactions require high-quality structural training data reflecting correct folding for both peptide and enzyme along with the precise knowledge of buried peptide binding positions and poses. Furthermore, accurate geometric and electronic considerations of cofactors would be necessary to predict the peptide and protein residue interactions with the co-factors. The scarcity of structural training data for such instances results in a relatively worse model performance on PepPIs. Recent efforts, such as RoseTTAFold All-Atom132 (RFAA), aim to address this challenge. RFAA can model full biological assemblies, including metal cofactors, by training on a comprehensive dataset comprising sequence information, residue pairwise distance from homologous templates, and coordinates of protein–small molecule, protein–metal, and covalently modified protein complexes. As a result, RFAA demonstrates reasonable prediction performance and stands out as the first model capable of predicting arbitrary higher-order biomolecular complexes, encompassing multiple proteins, small molecules, metal ions, and nucleic acids. However, this is a recent development, so there are no applications of RFAA to PepPI prediction. As advancements in structural biology and computational methods continue, it is foreseeable that more sophisticated models will emerge, further enhancing the capability to accurately predict PepPIs, even involving buried peptides and cofactors. Additionally, ML/DL methods often failed in the prediction of PepPIs between intrinsically disordered peptides (IDPs) and proteins. IDPs are abundant in nature, with flexible and disordered structures but adopt stable and well-defined structures upon binding. In these cases, ML/DL methods, particularly structure-based models, tend to fail in predicting binding sites and peptide binding conformations, offering little insights into the binding mechanism. With the enhancement of computing power, high-throughput MD simulations can achieve more accurate predictions of binding sites and peptide/protein conformations as well as a deeper understanding of the mechanism of folding and binding, induced fit (binding then folding), or conformational selection (folding then binding). The integration of MD or quantum chemical insights and ML/DL methods could constitute a promising future research direction of PepPI predictions.
Another future direction is to develop ML/DL models to predict cyclic peptide and protein interaction. Cyclic peptides have emerged as a promising therapeutical modality because of distinct pharmacological characteristics in comparison to small molecules and biologics.3,133,134 For example, cyclic peptides are more resistant to digestive enzymes like peptidases and exoproteases due to their stable cyclic structures. Cyclic peptides have a broader interaction surface than small-molecule drugs and thus may function as inhibitors with high affinity and selectivity for modulating protein–protein interactions. Furthermore, cyclic peptides exhibit better permeability across cell membranes and are less expensive to synthesize compared to antibodies. However, the development of deep learning models for designing cyclic peptides has faced challenges, mostly due to the small number of available structures. Recently, Rettie et al. introduced the AfCycDesign approach, a novel modification of the AlphaFold network for accurate structure prediction and design of cyclic peptides.135 Standard positional encoding in AlphaFold is based on the position of each amino acid in the linear peptide, with the termini being the maximum distance from each other. AfCycDesign modifies the positional encoding with cyclic offset such that the termini are connected to each other. This approach can accurately predict the structures of cyclic peptides from a single sequence, with 36 out of 49 cases predicted with high confidence (pLDDT > 0.85) matching the native structures with root mean squared deviation (RMSD) < 1.5 Å. Kosugi et al. employed the relative positional encoding with cyclic offset to predict protein–cyclic peptide complexes.136 The cyclic offset was only applied in the cyclic peptide region, while the positional encoding of the protein region remained the default one. The predictions outperformed state-of-the-art local docking tools for cyclic peptide complexes.
Future research directions should also prioritize the enhancement of model's ability to generate novel peptide sequences to specific target proteins of interest, thereby contributing to de novo peptide drug design. An essential way is to fine-tune pre-trained pLM. Introducing noises and perturbations within the peptide latent space of pLM, or masking peptide sequences to facilitate the model to learn the probability distribution of peptide binders, could be explored to generate entirely new peptide sequences. Additionally, diffusion models offer another avenue for achieving the generative tasks. These models possess a deeper understanding of the intricate molecular interactions at the atomic levels, thus enabling the generation of new peptide sequences based on peptide–protein complex structures. The resultant novel peptide sequences can be subsequently validated through MD simulations and in vitro and in vivo experimental tests. Therefore, developing new generative models or leveraging the pre-trained ML/DL models to facilitate peptide generation represents a noteworthy and promising future for advancing peptide drug design.
In conclusion, ML/DL-guided methods have shown significant potential for the accurate predictions of peptide–protein complex structures and binding sites. These SOTA models will undoubtedly further accelerate the process of peptide drug discovery and design.
Footnote |
† These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2024 |