Jose Isagani B. Janairo*
Department of Biology, De La Salle University, 2401 Taft Avenue, Manila, 0922, Philippines. E-mail: jose.isagani.janairo@dlsu.edu.ph
First published on 12th July 2023
Metal-binding peptides play a central role in bionanotechnology, wherein they are responsible for directing growth and influencing the resulting properties of inorganic nanomaterials. One of the key advantages of using peptides to create nanomaterials is their versatility, wherein subtle changes in the sequence can have a dramatic effect on the structure and properties of the nanomaterial. However, precisely knowing which position and which amino acid should be modified within a given sequence to enhance a specific property can be a daunting challenge owing to combinatorial complexity. In this study, classification based on association rules was performed using 860 gold-binding peptides. Using a minimum support threshold of 0.035 and confidence of 0.9, 30 rules with confidence and lift values greater than 0.9 and 1, respectively, were extracted that can differentiate high-binding from low-binding peptides. The test performance of these rules for categorizing the peptides was found to be satisfactory, as characterized by accuracy = 0.942, F1 = 0.941, MCC = 0.884. What stands out from the extracted rules are the importance of tryptophan and arginine residues in differentiating peptides with high binding affinity from those with low affinity. In addition, the association rules revealed that positions 2 and 4 within a decapeptide are frequently involved in the rules, thus suggesting their importance in influencing peptide binding affinity to AuNPs. Collectively, this study identified sequence rules that may be used to design peptides with high binding affinity.
Among these, the amino acid composition of the peptide was one of the earliest factors to be studied.10,11 These studies revealed that subtle changes in the primary structure can have drastic effects on the size, shape, and properties of the material. Building on these results, it was later determined that the position at which the residue substitutions were made had a large impact on the nanomaterials. For instance, it was demonstrated that substituting residues 6 and 11 for the Pd4 palladium-binding peptide substantially altered the catalytic activity of the produced Pd nanoparticles.12 For the AG3 gold-binding peptide, changing the amino acid based on the fourth13 and ninth14 positions can have a huge impact on the structure and arrangement of the produced AuNPs. These studies underscore the importance of both amino acid modifications and the correct position at which these modifications are carried out to tune the properties of the peptide and the created materials. However, determining these factors can be extremely challenging because of the numerous possible combinations that need to be explored. Despite this daunting challenge, understanding how specific sequence variations can impact the nanomaterial synthesis process may lead to the discovery of sequence patterns that can be used for the rational design of metal-binding peptides possessing desired properties, such as strong binding affinity. This will also facilitate a deeper and more precise understanding on the level of the amino acid on interactions that exist between biomolecules and inorganics. Machine learning (ML) is a promising approach to address the complexity of the problem, wherein ML can uncover and understand these sequence patterns from a given dataset. In particular, data mining can be used to establish associations between the primary structure and property of the peptide.
In this study, classification based on association rules (CBA) was implemented on a dataset composed of decapeptides and their experimentally determined binding affinities for gold nanoparticles (AuNPs). The analysis presents a series of sequence rules that can be used to categorize peptides based on whether they strongly or poorly bind to AuNPs. Based on these rules, specific amino acids and positions within the decapeptide were identified that exhibited strong associations with binding affinity.
The rules were pruned using the M2 method. The rules are then used for classification; for example, when categorizing a case that has not been encountered before, the initial rule that matches the case determines its classification. If there are no rules that apply to the case, it is assigned to the default class, which is Class B. Rule confidence is the primary criterion in determining rule priority followed by support. The performance of the extracted rules in classifying the decapeptides in the test set was assessed using accuracy, F1, and the Matthews Correlation Coefficient (MCC) as the main metrics for evaluating the performance of the rules. These performance metrics were calculated using the following formula:
The goal of CBA is to identify a rule in the form of an association that is valuable, providing insights about the dataset that were previously unknown and likely difficult to explicitly express. Contextualizing this goal into the present research translates into searching the dataset of decapeptides to find patterns in the amino acid composition and the peptide position that are associated with strong binding affinity. Association rule mining is a machine learning data mining technique that extracts rules from a given set of data, which can then be used for classification.16 Association rules have been previously used to study protein composition, such as in the identification of hydrophobic sequence motifs associated with a particular secondary structure.21 In another example, association rule mining was applied to protein sequences to uncover global associations in the amino acid composition of known proteins.22 Thus, the goal of association rule mining and CBA, which are data mining techniques, is to uncover valuable patterns within the data. This study is the first instance in which CBA will be used to study metal-binding peptides and aims to determine sequence motifs associated with strong binding affinity for AuNPs. This is an important step towards rational peptide design, which remains a challenge for bionanotechnology.23
The support or frequency of occurrence of the rules is an important factor to be considered when association rules within a given set of transactions are extracted.24 Support quantifies how frequently an itemset appears in the dataset; in this case, how often a specific amino acid appears in a particular position in the dataset. On the other hand, confidence quantifies the reliability of predicting a specific decapeptide into either class when a particular amino acid in a given position is observed in the input data. A higher confidence value indicates stronger association between an amino acid in a given position and the peptide category (Class A or B). Therefore, a high confidence value suggests that the presence of a particular amino acid at a given position increases the likelihood of the corresponding peptide category. The minimum support was first tuned to identify the most suitable threshold that balances classification performance and the number of rules generated.
As shown in Table 1, a low support cutoff generates many rules, leading to overfitting. However, a relatively high support limits the number of rules generated, leading to poor classification performance. A minimum support threshold of 0.035 was selected because it appears to be the best compromise between rule complexity as measured by the number of extracted rules, and classification performance. Nineteen rules associated with high-binding affinity peptides were identified, whereas 11 for low-binding affinity peptides (Table 2). All the identified rules are considered important because the lift values are greater than 1 and the confidence is greater than 0.9. Lift is a measure of rule strength because it indicates the extent to which the right-hand side (RHS) of the rule occurs when the left-hand side (LHS) is present. Thus, a lift greater than one indicates a high degree of association, which is not attributed to chance between the LHS of the rule and the RHS. Conversely, a lift value of less than 1 suggests a negative association, whereas lift = 1 indicates no association or independence between the LHS and the RHS.
Minimum support | ||||||
0.01 | 0.02 | 0.03 | 0.035 | 0.04 | 0.05 | |
Training set (n = 602) | ||||||
Accuracy | 1 | 1 | 0.982 | 0.968 | 0.89 | 0.836 |
F1 | 1 | 1 | 0.982 | 0.968 | 0.88 | 0.818 |
MCC | 1 | 1 | 0.963 | 0.937 | 0.794 | 0.684 |
Test set (n = 258) | ||||||
Accuracy | 0.981 | 0.981 | 0.954 | 0.942 | 0.814 | 0.783 |
F1 | 0.981 | 0.981 | 0.954 | 0.941 | 0.7838 | 0.748 |
MCC | 0.962 | 0.962 | 0.907 | 0.884 | 0.654 | 0.589 |
Number of association rules | 59 | 45 | 35 | 30 | 22 | 10 |
Association rule | Support | Confidence | Lift | |
---|---|---|---|---|
1 | {P1 = W, P10 = W} → {Class = A} | 0.0764 | 1 | 2 |
2 | {P2 = W, P3 = K} →{Class = A} | 0.0598 | 1 | 2 |
3 | {P4 = W, P8 = G} →{Class = A} | 0.0465 | 1 | 2 |
4 | {P2 = W, P7 = H} → {Class = A} | 0.0449 | 1 | 2 |
5 | {P1 = H, P2 = W} → {Class = A} | 0.0415 | 1 | 2 |
6 | {P2 = W, P4 = W} → {Class = A} | 0.0415 | 1 | 2 |
7 | {P2 = W, P6 = M} → {Class = A} | 0.0382 | 1 | 2 |
8 | {P2 = T, P10 = W} → {Class = A} | 0.0365 | 1 | 2 |
9 | {P4 = Q, P9 = W} → {Class = A} | 0.0365 | 1 | 2 |
10 | {P5 = W, P10 = W} → {Class = A} | 0.0465 | 0.966 | 1.931 |
11 | {P5 = E, P7 = H} → {Class = A} | 0.0449 | 0.964 | 1.929 |
12 | {P2 = W, P9 = A} → {Class = A} | 0.0432 | 0.963 | 1.926 |
13 | {P5 = W, P6 = E} → {Class = A} | 0.0365 | 0.957 | 1.913 |
14 | {P4 = W, P7 = Q} → {Class = A} | 0.0349 | 0.955 | 1.909 |
15 | {P8 = W} → {Class = A} | 0.0914 | 0.948 | 1.897 |
16 | {P1 = W} → {Class = A} | 0.148 | 0.947 | 1.894 |
17 | {P5 = K, P6 = W} → {Class = A} | 0.0415 | 0.926 | 1.852 |
18 | {P6 = M, P8 = G} → {Class = A} | 0.0398 | 0.923 | 1.846 |
19 | {P3 = Q, P10 = Q} → {Class = A} | 0.0332 | 0.909 | 1.818 |
20 | {P5 = H} → {Class = B} | 0.0631 | 1 | 2 |
21 | {P10 = R} → {Class = B} | 0.0565 | 1 | 2 |
22 | {P9 = K} → {Class = B} | 0.0498 | 1 | 2 |
23 | {P5 = T} → {Class = B} | 0.0432 | 1 | 2 |
24 | {P8 = R} → {Class = B} | 0.0431 | 1 | 2 |
25 | {P1 = S} → {Class = B} | 0.0432 | 1 | 2 |
26 | {P7 = R} → {Class = B} | 0.0432 | 1 | 2 |
27 | {P10 = L} → {Class = B} | 0.0399 | 1 | 2 |
28 | {P6 = R} → {Class = B} | 0.0382 | 1 | 2 |
29 | {P3 = R} → {Class = B} | 0.0365 | 1 | 2 |
30 | {} → {Class = B} | 0.5 | 0.5 | 1 |
The extracted rules highlight the importance of tryptophan for a high-binding affinity peptide. This was evidenced by the prevalence of tryptophan in the extracted rules at multiple positions, as well as its combination with other amino acids. Eighty-four percent of the rules for categorizing peptides into Class A involve tryptophan (Fig. 1). The importance of this residue in AuNP synthesis is known for its ability to reduce Au3+ ions,25,26 wherein increasing the number of tryptophan residues in a gold biomineralization peptide leads to an increase in reducing efficiency.27 However, the binding affinity data for the decapeptides used in this study were obtained using already formed AuNPs. This suggests that the importance of tryptophan in gold-binding peptides is due to its contribution to binding to the AuNP surface and not for its ability to reduce Au ions and influence nucleation. This is consistent with the knowledge that tryptophan can bind to AuNPs through its indole, carboxyl, and amino group.28,29 Moreover, when the tryptophan residue of the gold-binding peptide AuBP1 (WAGAKRLVLRRE) was changed to alanine, the binding affinity of the peptide for gold, as determined through QCM measurements substantially decreased.30 These studies confirm that tryptophan is directly involved in anchoring the biomolecule onto the metal surface.
Fig. 1 Frequency of amino acids in the association rules that classify peptides into either Class A or B. |
Among the rules that classify peptides into Class B, 46% involve arginine. Notably, the rules that classify peptides into Class A did not involve arginine (Fig. 1). The strong association of arginine with low-binding affinity peptides was unexpected, considering that this amino acid is known to bind strongly with gold surfaces.31,32 However, the exact location of arginine within the decapeptide may have a stronger contribution, leading to an association with weaker binding affinity for AuNP. While amino acid composition is an important factor in determining the ability of peptides to bind to inorganic surfaces, the orientation of the peptide during binding is equally important. Peptides have very different structures in the unbound state compared to the bound state33 which demonstrates the importance of peptide structure and orientation in relation to binding. In addition, the manner in which the peptide is bound to the surface of the nanoparticle also affects the resulting properties of the nanomaterial.3 Therefore, the rules for low-binding peptides may suggest that the placement of arginine at these specific locations can influence peptide structure, leading to decreased affinity for gold. This is highly plausible considering that it was observed before that the precise placement of arginine within a given sequence can have long-range effects on the peptide structure, especially on helical motifs.34 Related to this observation, this study identified positions within the decapeptide sequence that appeared to be critical in influencing the binding behavior of the peptide. Positions 2 and 4 were frequently present in the mined association rules for differentiating Class A peptides from B (Fig. 2). The exact reason why these identified positions influence the classification of the peptides based on binding affinity requires further examination and should be explored in future studies. However, it can be postulated that these positions are important for anchoring the peptide onto the gold surface. The structure and orientation of bound metal-binding peptides are dictated by the anchoring points, wherein the contributing residues are always in direct contact with the surface.35 As has been shown in previous studies, specific positions within a sequence are important for the binding affinity of the biomolecule. For the AYSSGAPPMPPF gold-binding peptide for instance, it was discovered that the second, ninth, and twelfth residues are anchoring points onto the gold surface,36 and changing the amino acid composition at any of these locations can be used to modulate the binding affinity. For the AuBP1 peptide, the C- and N-termini appear to play an integral role in the peptide-binding process because point mutations at these locations severely incapacitated the peptide to bind to the gold surface.30 Apart from influencing the anchoring behavior of peptides, specific mutations within the sequence can also alter gold-binding behavior by varying the interaction of the peptide with the solvent.37 Considering the importance of the location and correct placement of amino acids within a sequence, the present study identified precise locations within a decapeptide that can be further explored to evaluate how they can be used to optimize peptide properties.
Fig. 2 Frequency of the positions within the decapeptide in the association rules that classify peptides into either Class A or B. |
Identifying “hotspots” in a peptide sequence can be immensely difficult owing to combinatorial complexity. Using machine learning in the form of association rule mining, an alternative method for the discovery of important positions and amino acids that are associated with binding affinity is presented. Although ML models have been created in the past to study metal-binding peptides,18–20,38,39 this is the first attempt to use CBA to study their sequence patterns. Previous ML applications have focused on identifying the desired physicochemical properties of peptides associated with high binding affinity. While the results of these analyses have shed light on the general features of metal-binding peptides, the sequence rules presented in this study offer insights that are more specific and precise regarding the amino acid composition and location. The results of this study offer a starting point for the design and optimization of gold-binding peptides.
As with any model, the sequence rules discovered through data mining have inherent limitations. The first limitation is the applicability of the rules to decapeptides. Since the rules are based on amino acid composition and position, they are only relevant to decapeptides. This limitation is due to the scarcity of available data in this domain, which is why external validation was not conducted. Best effort was exerted to find reported decapetides with high and low binding affinities. Unfortunately, the search did not yield suitable data for external validation. Be that as it may, the findings presented still offer novel insights on how the peptide sequence can influence binding behavior to gold surfaces.
These results provide the groundwork for further exploration to fully elucidate why these positions are frequently involved in the rules, and to pinpoint the role of tryptophan and arginine residues in AuNP synthesis. For example, the peptide with the highest binding affinity in the dataset could be used as a model to analyze the effects of systematically changing the amino acid at positions 2 and 4. In addition, molecular dynamics simulations can be carried out to derive insights on the impact of tryptophan and arginine residues within the decapeptide on its conformation. Similarly, DFT calculations can also be employed to study the rules from a reactivity perspective. These are exciting points of inquiry which can shed light on these research questions and may lead us closer to the rational design of metal-binding peptides.
This journal is © The Royal Society of Chemistry 2023 |