Douglas H. Chang‡
a,
Joshua D. Richardson‡
a,
Myung-Ryul Lee
a,
David M. Lynn
*ab,
Sean P. Palecek
*a and
Reid C. Van Lehn
*ab
aDepartment of Chemical and Biological Engineering, University of Wisconsin–Madison, Madison, WI, USA. E-mail: david.lynn@wisc.edu; sppalecek@wisc.edu; vanlehn@wisc.edu
bDepartment of Chemistry, University of Wisconsin–Madison, Madison, WI, USA
First published on 20th February 2025
Antimicrobial peptides (AMPs) are promising compounds for the treatment and prevention of multidrug-resistant infections because of their ability to directly disrupt microbial membranes, a mechanism that is less likely to lead to resistance compared to antibiotics. Unfortunately, natural AMPs are prone to proteolytic cleavage in vivo and have relatively low selectivity for microbial versus human cells, motivating the development of synthetic peptidomimetics of AMPs with improved peptide stability, activity, and selectivity. However, a lack of understanding of structure–activity relationships for peptidomimetics constrains development to rational design or experimental predictors, both of which are cost and time prohibitive, especially when the design space of possible sequences scales exponentially with the number of amino acids. To address these challenges, we developed an iterative Gaussian process regression (GPR) approach to explore a large design space of 336000 synthetic α/β-peptide analogues of a natural AMP, aurein 1.2, based on an initial training set of 147 sequences and their biological activities against microbial pathogens and selectivity for microbes vs. mammalian cells. We show that the quantification of prediction uncertainty provided by GPR can guide the exploration of this design space via iterative experimental measurements to efficiently discover novel sequences with up to a 52-fold increase in antifungal selectivity compared to aurein 1.2. The highest selectivity peptide discovered using this approach features an unconventional substitution of cationic amino acids in the hydrophobic face and would be unlikely to be explored by conventional rational design. Overall, this work demonstrates a generalizable approach that integrates computation and experiment to accurately predict the selectivity of AMPs containing synthetic amino acids, which we employed to discover new α/β-peptides that hold promise as selective antifungal agents to combat the antimicrobial resistance crisis.
One strategy to address these issues is to substitute traditional α-amino acids with noncanonical β-amino acids, which have an additional carbon in the backbone relative to α-amino acids, in specific patterns and positions of the original sequence, as shown in Fig. 1.9–13 This class of synthetic peptides, known as α/β-peptides, can exhibit enhanced proteolytic resistance while maintaining side chain presentations similar to those of α-peptides, enabling the templating of sequences on naturally occurring AMPs.14–18 In addition, using β-amino acids enables the exploration of a much larger array of amino-acid combinations compared to naturally occurring peptides (such as repeating ααβ or ααβαααβ backbone patterning or the presence of the helix-stabilizing component ‘ACPC’ shown in Fig. 1), which could lead to enhanced stability and selectivity profiles. For example, our previous studies introduced α/β-peptides developed as analogues of aurein 1.2, an AMP found in Australian bell frogs, to enhance their antifungal selectivity against Candida albicans (Fig. 1).9,19 C. albicans is the most prevalent hospital-acquired fungal pathogen and is known to cause bloodstream infections,20 leading to a mortality rate of up to 70% in the case of sepsis.21 Despite an alarming increase in antifungal resistance, only a handful of antifungal drug classes are approved for use in the clinic, motivating a drive for the development of new treatments, including AMP-based therapeutics.22–24 Recent rational design efforts by our group have resulted in α/β-peptides that exhibit significantly enhanced antifungal selectivity, with up to a 22-fold improvement over aurein 1.2.19
The antifungal selectivities of α/β-peptides in our prior work were quantified using a selectivity index (SI). This parameter is the ratio between mammalian cell toxicity (often defined as ‘HC10’, or the concentration of peptide required to cause 10% hemolysis, or the lysis of human red blood cells) and antifungal activity (defined as the minimum inhibitory concentration, or ‘MIC’, needed to inhibit at least 90% growth of fungal cells).9 The SI is used as a metric to guide the design of highly selective (high-SI) antimicrobial AMPs and mimetics.9,25,26 However, obtaining AMPs and peptidomimetics with high antifungal SI remains challenging compared to the design of highly selective antibacterial AMPs because antifungal toxicity correlates more strongly with hemolysis than does antibacterial toxicity.19,27 This challenge is largely attributed to the more similar membrane compositions of eukaryotic fungal and mammalian cells compared to prokaryotic bacterial cells, which affect the membrane-active mechanism of most AMPs and mimetics.28–30 While many peptides can disrupt membranes, the most valuable therapeutic options would be those that are most selective.
Our past studies have shown that experimentally determined peptide physicochemical properties such as helical rigidity and hydrophobicity can be used to accurately predict fungal activity and hemolysis metrics, in addition to providing specific ranges over which high selectivity could be expected.9,19 However, experimentally determined physicochemical properties require low-throughput biophysical characterization. In addition, most physicochemical properties of short peptides are highly correlated, rendering rational design a difficult strategy because altering a single amino acid can result in changes in multiple physicochemical properties.31 Consequently, rational design approaches require educated guesses about the results of a single amino acid substitution followed by α/β-peptide synthesis, an overall approach that is often prohibitive at large scale in terms of time and financial cost. Therefore, a predictive method to navigate the sequence space and discover novel α/β-peptide sequences with improved antifungal selectivity would be useful and could significantly expedite development. Unfortunately, the prediction of novel α/β-peptide sequences is hindered by the relative lack of experimental data available in databases compared to natural α-peptide AMPs (such as APD3,32 CAMPR3,33 and DRAMP34). This data scarcity limits the application of large database-driven machine learning (ML) methods, such as deep learning,35–37 natural language models,38 and variational autoencoding,39 which have proven useful in past studies for the prediction of novel canonical AMP sequences. Additionally, models trained on quickly generated sequence-based data representations for α-peptides involving only the 20 canonical amino acids (e.g., one-hot encoding40,41) are not directly transferable to AMPs containing synthetic amino acids due to these data scarcity limits. Consequently, predicting new sequences of synthetic peptidomimetics remains challenging.
In this study, we demonstrate that highly selective α/β-peptidomimetics can be discovered by iteratively predicting their biological activities in silico using an ML model with computationally derived descriptors and then measuring selectivity using in vitro experimental measurements to test model predictions and update model parameters. Using an initial dataset of 147 α/β-peptide sequences as a starting point,9,19 we developed an iterative Gaussian process regression (GPR) workflow to predict HC10 and MIC values of novel α/β-peptide sequences templated on the most selective sequences in this initial training set. As input to the GPR model, we used 2D molecular descriptors that can be quickly computed from peptide chemical structures without relying on large databases or computationally expensive predictive techniques. Peptides were selected for synthesis based upon GPR predictions, with experimentally measured HC10 and MIC values then used to update GPR model parameters to guide the next prediction round. Because GPR estimates the uncertainty of regression predictions,42–46 this approach permits evaluation of promising new sequences with low uncertainty and high predicted SI while also testing sequences with high uncertainty to expand the design space (e.g., sequences with new amino acids) for future prediction rounds. After six rounds, we identified new selective α/β-peptides with up to a 52-fold improvement in SI compared to aurein 1.2. Although these new peptides possess desirable physicochemical properties typically indicative of high antifungal selectivity, their sequences, which largely contain unconventional amino acid substitutions, would be less likely to be identified through rational design. Our results demonstrate the effectiveness of an iterative GPR-guided strategy to screen through a large sequence space and identify multiple new α/β-peptides with significantly enhanced antifungal selectivity. These peptides could, with further development, prove valuable as therapeutic agents to combat the rising incidence of drug-resistant fungal infections.
![]() | (1) |
![]() | ||
Fig. 2 Schematic demonstrating the integrated computational and experimental iterative approach used in this study. (a) Training set of ‘n’ sequences with corresponding HC10 and MIC labels for Gaussian process regression (GPR). (b) All sequences were converted to SMILES string representations (Section S1†), in which selected amino acids are highlighted in red for visualization. (c) After calculating 200 descriptors for each sequence and removing all constant descriptors, LASSO cross validation was used to reduce the set of descriptors. The reduced set of descriptors was then used to train separate GPR models for HC10 and MIC using data in the training set (right plot, see Table S4† for model parameters). (d) The trained GPR model was applied to predict HC10 and MIC for all peptides in the design space generated from the template sequence. New sequences were selected for experimental synthesis based on a balance between sequences with low uncertainty predictions for model validation and high uncertainty predictions for further model training. (e) HC10 and MIC values for new sequences were quantified experimentally, compared to model predictions, and added to the training set (a) for the next round. This figure includes graphics imported from https://www.biorender.com. |
From two-fold serial dilutions, HC10 was interpolated continuously, whereas MIC was discretely determined to stay consistent with conventional methodologies.9,19 All HC10 and MIC data were log2 scaled to accommodate the serially diluted nature of the experimental measurements. Of the 147 sequences, fifteen are newly reported in this study (Appendix I) and were evaluated with the same methodology as that of prior studies to ensure reliable SI quantification. As depicted in Fig. 2a, the training set consisted of peptides with β-amino acids prefixed with a “β” in the amino-acid sequence, in addition to the β-amino acid ‘ACPC’ (trans-2-aminocyclopentane-carboxylic acid), simplified as ‘X’ (Fig. 1 and 2a) which is a cyclic 5-membered ring that improves helical folding and stability.14,49–51 For our approach, each sequence in the training set (Fig. 2a; i.e., the sequences with corresponding experimentally measured HC10 and MIC) was first transformed into its simplified molecular-input line-entry system (SMILES) representation (Fig. 2b), which uniquely represents the bonding, branching, ring groups, and aromaticity of each peptide52 through the definition of both backbone and sidechain fragments (see Section S1 and Fig. S1† for more details). Unlike popular sequence encodings that can only categorically distinguish between amino acids in peptide sequences (e.g., binary vectors in one-hot encoding40,41), the SMILES data representation captures important molecular properties of these amino acids in the context of their local bonding environment. We utilized these SMILES strings as input to the RDKit Cheminformatics toolkit to calculate 200 molecular descriptors for each sequence.53 Descriptors are numerical values associated with each sequence that quantify properties such as atomic van der Waals surface areas (VSA), partial charges, bonding and branching complexity, and chemical group counts (Section S2†).
The total number of descriptors was large compared to the number of training sequences, indicating that model training could potentially lead to overfitting, which would not generate robust predictions. Therefore, the total number of descriptors used during model training in each prediction round was reduced in two steps. First, all descriptors that had constant numerical values for all sequences in the training set were excluded because they cannot distinguish between properties of different sequences (Fig. S2†). Second, we performed 10-fold LASSO cross validation (CV) (left plot in Fig. 2c) to minimize the average root-mean-square error of model predictions compared to the measured HC10 and MIC values (further details in Section S2 and Fig. S3†). This approach quantifies the trade-off between the number of descriptors and prediction accuracy to identify a reduced set of descriptors that yields acceptable prediction accuracy. This procedure resulted in between 13 and 40 descriptors used for model training for each round. Utilizing this reduced descriptor set per round (Tables S1 and S2†), the highest accuracy (R2) GPR model (right plot in Fig. 2c) from a hyperparameter search was then selected (see Table S4†) for HC10 and MIC predictions of novel α/β-peptide sequences. We note that separate models, using separate sets of descriptors, were trained to independently predict HC10 and MIC values. GPR model selection criteria are described in Section S3.† For all regression steps, HC10 and MIC values were log2 scaled (Fig. S6†) to simplify comparisons to experimental values of hemolysis and antifungal activity, which are typically assessed, by convention, using 2-fold serial dilutions of peptide concentrations.
The two trained GPR models (for HC10 and MIC) were then used to assess peptides from the design space of 168000 possible sequences (for a single ‘template sequence’), which we refer to as ‘test sequences.’ The ‘template sequences’ were the three highest SI sequences in the initial training set (Fig. 2a) for each backbone (specifically ααβ and ααβαααβ, depicted in orange in Fig. 3c and d) that were selected to provide a well-defined sequence space to explore in the step shown in Fig. 2d using the iterative GPR approach (details on design space generation are explained in depth in the next section, ‘properties of training sequences and design space generation’). Test sequences were converted to SMILES representations and corresponding descriptors were calculated. Due to the large range of descriptor values for the test sequences (up to six times larger than the upper bound of descriptor values for training sequences, see Fig. S7 and S8†), we only used the GPR models to predict HC10 and MIC values for test sequences for which all descriptor values were within the bounds of the values computed for training sequences to prevent large prediction inaccuracies due to extrapolation (Table S5†). We then utilized the ability of GPR to quantify prediction uncertainty through standard deviations (σ) to inform the selection of test sequences for experimental synthesis. To compare prediction uncertainties across prediction rounds by accounting for the changes in standard deviation distributions that resulted from updating the descriptors (Section S2†) and the GPR models (Section S3†) every round, we define the normalized standard deviation (NSD), which varies from 0 to 1, in eqn (2):
![]() | (2) |
![]() | ||
Fig. 3 Overview of α/β-peptides and training design space for the Gaussian process regression workflow. (a) Helical wheel representation of aurein 1.2, which was used as the starting sequence to generate a library of 147 α/β-peptide sequences (used as the initial training set in our study) in previous rational design studies. Amino acid types are color-coded as follows: nonpolar (orange), polar (dark blue), basic (cyan), and acidic (red). “β” refers to amino acids with an additional backbone carbon to increase stability compared to traditional α-amino acids and R denotes side chain. (b) Number of training sequences (out of 147, denoted with dashed line) that contain at least one of each α- (top) and β- (bottom) amino acid. Amino acids (α and β) absent from the training peptides but utilized in this study are boxed at lower right. Amino acids are color coded in the same manner as in the previous panel. (c and d) Distribution of training sequences for the (c) ααβ and (d) ααβαααβ backbones according to their hemolysis (log2(HC10)) and antifungal activity (log2(MIC)). Grey lines indicate constant SI values equal to 10, 20, and 30. The top three SI template peptides for each backbone are shown as orange points. The corresponding amino-acid template sequence is shown with positions corresponding to varying amino acids in red. “—NH2” refers to an amidated C-terminus, whereas the N-terminus is +1-charged (these properties are shared for all peptide sequences). All other training sequences are shown as blue points. log2(HC10) vs. log2(MIC) distributions for the other two backbones used for model training (αααβ and αβαβααβ) are shown in Fig. S9,† and the combined activity distribution for all 4 backbone types is plotted in Fig. S10.† |
Fig. 3b visualizes the occurrence of each α- and β-amino acid in the initial 147 sequence training set (dashed horizontal line) in terms of their polarity: nonpolar (orange), polar (dark blue), charged basic (cyan), and charged acidic (red). Only 13 of the 20 possible canonical α-amino acids were present in the 147-peptide training dataset. Additionally, only five α-amino acids (phenylalanine ‘F’, lysine ‘K’, isoleucine ‘I’, glycine ‘G’, leucine ‘L’) and three β-amino acids (β-homoisoleucine ‘βI’, ACPC ‘X’, β-homophenylalanine ‘βF’) were represented in at least half of the training sequences. Therefore, a large sequence design space remains unexplored, even for short 13-amino-acid peptides. To reduce this design space, we focused on templating and improving on high selectivity (high-SI) peptides in the training set based on amino-acid positions that vary amongst them (red amino acids in Fig. 3c and d), where the higher-SI peptides trend towards the top left of the plots in Fig. 3c and d.
The three highest-SI peptides of the 147 training sequences consisted of the ααβ backbone (indices #241, #239, #231) with SI values ranging from 11.6 to 24.1; we selected these three peptides as template peptides for the ααβ backbone. The sequences of these peptides varied in amino-acid positions 5, 10, and 12, and we additionally chose to vary amino-acid position 1 during the iterative GPR workflow to investigate the effects of N-terminal substitutions on peptide hemolysis and antifungal activity. Previous experimental structure–activity studies have found that substituting the N-terminal ‘G’ in aurein 1.2 with ‘A’ significantly decreased activity against C. albicans,56 whereas a ‘G’ to ‘A’ substitution increased broad-spectrum activity against bacteria.57 The three template ααβ-peptide sequences are shown as orange points in Fig. 3c along with the other 32 training sequences on the ααβ backbone (dark blue points) for reference. The 4–6th highest SI peptides in the training set (#133, #029, #131) consisted of the ααβαααβ backbone with SI values ranging from 9.1 to 10.6 (orange points in Fig. 3d); we selected these three peptides as template peptides for the ααβαααβ backbone. We chose to consider amino acid variations in these three peptides (at positions 1, 6, 10, 12) given their high antifungal activity, albeit with lower selectivity compared to the ααβ-peptides. Fig. 3c and d shows the two resulting template sequences (for the ααβ and ααβαααβ backbones, respectively) with the four amino-acid positions in each sequence that were varied during the GPR approach indicated in red.
For α positions in the template sequences, we considered the 13 amino acids included in the training data (top histogram in Fig. 3b) plus five natural amino acids (methionine ‘M’, tryptophan ‘W’, tyrosine ‘Y’, asparagine ‘N’, glutamine ‘Q’) and two hydrophobic synthetic amino acids (norvaline ‘Nva’, norleucine ‘Nle’) for a total of 20 possible substitutions. Nva and Nle are extensions of the α-aminobutyric acid (Abu) amino acid present in training peptides and were included based on previous studies suggesting that these amino acids increase membrane binding by promoting positive membrane curvature58 and improve antimicrobial activity by increasing peptide hydrophobicity.59 Three natural amino acids were not considered: proline 'P' due to destabilization of AMP folding by inducing kinks in helices,60,61 cysteine ‘C’ due to its capability to form disulfide bridges,62 and histidine ‘H' due to its sensitivity to physiological pH, leading to changes in charge.63 For β positions in the template sequences, the β version of the 20 α-amino acids described above plus ACPC ‘X’ were considered for a total of 21 possible substitutions. Given the three α- and one β-amino acid positions varied for each backbone type (red amino acids in Fig. 3c and d), two different design spaces of 168000 (203 × 21) possible sequences were considered for this study for a total of 336
000 sequences. As described in the previous section, ‘iterative GPR workflow implementation’, these design spaces were reduced each round (Fig. 2d, S18 and Table S5†), only considering test sequences for which all descriptor values were within the bounds of training descriptor values (Fig. S7 and S8†).
![]() | ||
Fig. 5 (a) Summary of GPR prediction results. Amino-acid substitutions relative to any training sequence are in red. The ‘Dif’ column quantifies the number of these substitutions, and the ‘New AA’ column denotes if any substitution is an amino acid not present in the training data for that round. Experimentally measured (Act.), GPR-predicted (Pred.), and their difference are shown for both log2(HC10) and log2(MIC). NSD refers to the normalized GPR standard deviation and ranges from 0 (dark green; low uncertainty) to 1 (dark red; high uncertainty). SI computed from the experimentally measured HC10 and MIC is also shown. (b and c) log2(HC10) vs. log2(MIC) distributions comparing original training (dark blue dots) vs. new peptides added during the iterative workflow (stars) for prediction rounds (b) 1 to 4 and (c) 5 to 6. Grey diagonal lines denote SI bands with values of 10, 20, 30, 40, and 50. All peptides discovered with larger SI values than the original highest-SI peptide from the initial training set (SI = 24.1) are indicated in bold. The horizontal dotted line at log2(HC10) = 12 denotes the upper limit of the hemolysis assay, corresponding to a peptide concentration of 4096 μg mL−1. The test design space and corresponding standard deviation ranges for all GPR prediction results are visualized in Fig. S18.† |
Fig. 4b and c shows parity plots for log2(HC10) and log2 (MIC) predictions for the final GPR model used for round 6, which not only visualize the final prediction accuracy, but also highlight the novel sequences introduced (red triangles) and their distributions of log2(HC10) and log2(MIC) values compared to the original 147 training sequences (blue open circles). As part of our evaluation of model robustness across prediction rounds (Fig. S14–S16†), we also probed the predictive power of the GPR models through y-randomization (Fig. S16†),69,70 which compares randomly shuffled labels with the ones used for model development. Overall, these results verified that the predictions made by the model were not from random chance, thereby validating the rigor of our descriptor and GPR model selection approach.
To further support the ability of the GPR model to capture relationships between input physicochemical properties of new amino acids and target activity measurements, we calculated the NSD distribution of each newly introduced amino acid in rounds 1 to 3 since the set of descriptors for HC10 and MIC predictions remained the same for these prediction rounds (Tables S1 and S2†). Bar and whisker plots of these NSD distributions are plotted in Fig. S17† for the Nle, ‘Q’, Nva, ‘βY,’ and ‘W’ amino acids, which show that, in general, there was a large decrease in the average NSD (therefore, a decrease in prediction uncertainty) for these amino acids in prediction rounds following their introduction. Interestingly, Nva (Fig. S17c†) largely followed the same trends as Nle (Fig. S17a†), despite not being introduced prior to the changes observed in NSD. This observation supports the model's ability to transfer learned information from amino acids present in the training sequences to new amino acids that have similar chemical structures, and highlights the advantage of 2D molecular descriptors over more traditional data representations for peptide sequences such as one-hot or orthogonal encoding,37,71 which would distinguish Nva and Nle as separate amino acids with no intuition for their chemical similarity.
In rounds 5 and 6, the more antifungal ααβαααβ backbone was explored as the template (Fig. 3d) using the same workflow as in rounds 1 to 4, but by switching the design space to both evaluate the transferability of our approach and to obtain higher-SI peptides. This change enabled us to use the large amount of low hemolysis (high log2(HC10)) training data from rounds 1 to 4 (Fig. 5a) to probe whether the model could be applied to a different peptide backbone with inherently higher antifungal activities (lower log2(MIC)) to find new high-SI peptides. Although there was an initial increase in the overall prediction RMSE in round 5 with the switch in backbone template from ααβ to ααβαααβ and selection of 2 test sequences with moderate uncertainty (5-2 and 5-3, Fig. 5a), the selection of 4 low uncertainty sequences to probe new high SI peptides in round 6 led to the lowest prediction RMSE across all prediction rounds (Fig. S13†). Interestingly, this approach was successful in discovering four additional high-SI peptides (5-4, 6-1, 6-3, 6-4; 27.8 < SI < 57.1) with high prediction accuracy and low uncertainty (low NSD) while drawing on learned trends from previous prediction rounds on the ααβ backbone (Fig. 5c). For instance, as seen in peptides 4-3 and 4-4, 1st position cationic (K, R) amino acids tend to produce peptides with low hemolysis (high log2(HC10)) and good selectivity (high-SI), which was reflected by many high-SI ααβαααβ candidate sequences in rounds 5 and 6 (5-4, 6-2, 6-3, 6-4) with a 1st position ‘K.’ Additionally, drawing on the ability of peptide 4-1 to increase selectivity with a cationic amino-acid substitution on the hydrophobic face of the ααβ backbone, an arginine (R) substitution on the hydrophobic face of a high-SI ααβαααβ template (#131) was predicted and experimentally validated, resulting in an over 6-fold improvement in selectivity (6-4 in Fig. 6). With an SI of 57.1, peptide 6-4 was the highest-SI peptide identified in this study and showcases a substantial increase in antifungal selectivity compared to the training data. Overall, this approach was able to discover 13 novel high-SI peptides (11.0 < SI < 57.1) with comparable or higher SIs than the template peptides (9.08 < SI < 24.1) (Fig. 5a). We further note that our SIs are significantly higher or compare favorably to the SIs of other antifungal AMP mimics reported in the literature (often around 16 for hRBC HC10 over C. albicans MIC),26,73–75 as well as certain clinically used small molecules such as amphotericin B (which has an SI of 1 (ref. 76 and 77)). Additional studies in animal models of fungal infection will be necessary to determine the extent to which these substantial increases in SI may also translate to therapeutic significance in different pre-clinical contexts.
The number of test sequences considered per round (between 8764 and 17238 sequences depending on prediction round, see Table S5†) is sufficient to explore large variations in amino acids and motifs as demonstrated in Fig. 5a, given that this design space is approximately 2 orders of magnitude larger than the initial peptide training set (147 sequences). We also note that the number of α/β-peptides that has been characterized experimentally is significantly smaller than α peptides, such that this design space represents a substantial number of novel sequences. To further increase the potential design space considered in our approach, we note that the decrease in the number of considered test peptide sequences (Table S5,† ‘total’ column) is largely a result of the increasing number of descriptors kept with LASSO CV from rounds 1-6 (Fig. S3†) because we require each potential test sequence to have all descriptor values fall within the lower and upper bounds of the training set per round. Therefore, although it is out of the scope of this study, a potential avenue for increasing the possible number of test sequences would be to truncate the number of descriptors kept with LASSO CV (e.g., keeping only the top 20 descriptors by LASSO coefficient weights) to increase the design space, particularly since the average RMSE for 10-fold LASSO CV only slightly increases when fewer descriptors are used (Fig. S3†). For instance, truncating the descriptor set to the top 20 descriptors increases the design space for round 6 for HC10 predictions from 10
111 to 34
419 sequences and for MIC predictions from 19
057 to 43
650 sequences while maintaining similar cross-validation prediction accuracies compared to the full descriptor set. Because our present approach successfully identified new peptides with improved selectivity indices that are relevant in potential therapeutic contexts (as discussed above), we do not foresee this approach as necessary, but one could utilize this method (or similar methods to relax the constraints on the number of descriptors) to expand the design space in future work or other related applications.
Overall, these results demonstrate that an iterative ML approach can successfully discover high-SI peptides through substitutions that would not have been rationally anticipated. The advantage of the SMILES representation for peptide sequences is that it is agnostic to the backbone type (combination of α and β amino acids), thereby promoting the transferability of the model across backbones in this study and, potentially, for other synthetic motifs (e.g., γ-amino acids with two additional backbone hydrocarbons compared to α-amino acids78) in the future.
Peptide type | Peptide Idx | Sequence | αβ motif | Retention time (min ± SD) | Charge | [θ] 100% TFE ± SD | [θ] 15% TFE ± SD | Helical rigidity % ± SD | Measured molecular weight (Da) |
---|---|---|---|---|---|---|---|---|---|
Wild type | Aurein 1.2 | G-L-F-D-I-I-K-K-I-A-E-S-F-NH2 | α | 23.93 ± 0.01 | 1 | −16.8 ± 0.4 | −17.1 ± 0.1 | 102 ± 2 | 1479.79 |
Template peptides | #241 | G-L-X-K-L-βA-K-K-X-I-K-βS-F-NH2 | ααβ | 15.45 ± 0.01 | 5 | −27.2 ± 1.7 | −9.1 ± 1.1 | 34 ± 5 | 1481.94 |
#239 | G-L-X-K-L-βA-K-K-X-L-K-βS-F-NH2 | ααβ | 16.11 ± 0.01 | 5 | −27.4 ± 2.9 | −10.5 ± 0.2 | 38 ± 4 | 1481.94 | |
#231 | G-L-X-K-I-βA-K-K-X-I-K-βA-F-NH2 | ααβ | 14.95 ± 0.01 | 5 | −20.5 ± 1.3 | −12.3 ± 1.6 | 60 ± 9 | 1465.94 | |
#133 | G-X-F-K-I-βA-K-K-X-A-K-L-βF-NH2 | ααβαααβ | 16.61 ± 0.01 | 5 | −30.9 ± 1.1 | −14.2 ± 2.0 | 46 ± 7 | 1499.96 | |
#131 | K-X-F-K-I-βA-K-K-X-V-K-L-βF-NH2 | ααβαααβ | 15.55 ± 0.01 | 6 | −26.2 ± 0.8 | −11.6 ± 1.4 | 44 ± 5 | 1599.13 | |
#29 | G-X-F-K-I-βI-K-K-X-A-K-S-βF-NH2 | ααβαααβ | 18.17 ± 0.01 | 5 | −35.2 ± 1.6 | −13.1 ± 0.9 | 37 ± 3 | 1515.96 | |
Newly discovered peptides | 4-1 | G-L-X-K-V-βA-K-K-X-I-K-βK-F-NH2 | ααβ | 13.50 ± 0.02 | 6 | −23.1 ± 0.8 | −7.2 ± 0.6 | 31 ± 3 | 1509.01 |
5-4 | K-X-F-K-I-βA-K-K-X-V-K-T-βF-NH2 | ααβαααβ | 14.02 ± 0.04 | 6 | −36.0 ± 0.6 | −8.5 ± 1.6 | 24 ± 4 | 1587.06 | |
6-1 | G-X-F-K-I-βA-K-K-X-V-K-S-βF-NH2 | ααβαααβ | 15.38 ± 0.08 | 5 | −33.7 ± 1.0 | −8.5 ± 0.9 | 25 ± 3 | 1501.96 | |
6-4 | K-X-F-K-I-βA-K-K-X-V-K-R-βF-NH2 | ααβαααβ | 13.33 ± 0.04 | 7 | −27.6 ± 2.2 | −7.3 ± 1.7 | 27 ± 6 | 1642.10 |
Generally, decreases in hydrophobicity are correlated with decreased mammalian cell toxicity,80–82 which, in turn, leads to higher selectivity. Supporting these trends, newly discovered high-SI peptides exhibited significantly decreased hydrophobicity in all cases compared to their respective high-SI templates (i.e., RT of 13-15 min compared to 15-18 min for ααβαααβ backbones). When compared to aurein 1.2, the differences were even greater, with an at least 10.5 minute reduction in RT, illustrating the importance of low hydrophobicity for antifungal selectivity. Similarly, the helical rigidities of newly discovered high-SI peptides were lower than those of the template peptides and aurein 1.2 (Table 1 and Fig. 7b). Quantified as the ratio between the molar ellipticity of peptide (minimum amplitude [θ]) in 15% and in 100% trifluoroethanol (TFE) and measured through circular dichroism (CD), helical rigidity describes the ability of a peptide to remain helical in aqueous conditions in comparison to its maximum helical conformation, which TFE is known to induce.83 The amplitude at the minimum wavelength was utilized as a measurement of helicity to remain consistent with prior structural studies coupling crystallography and NMR with CD.17,84 TFE is often used to mimic the microbial cell membrane environment,26,79,85 while 15% TFE in water has been used to obtain quantifiable differences in low-helicity peptides in aqueous conditions.9,19 Similar to hydrophobicity, high helical rigidity results in toxicity against human cells.19,26,86–93 For example, aurein 1.2 presented classic α-helical CD spectra in both aqueous and organic solvent conditions (15% and 100% TFE, Fig. 7b), with a high helical rigidity of 102 ± 2% (Table 1).
Both template and newly discovered high-SI peptides exhibited CD curves characteristic of helical α/β-peptides,9,13,17 with lower helicity in aqueous conditions (red line, [θ]15) compared to those in organic conditions (black line, [θ]100) (Table 1, Fig. 7b and S24†). While the presence of an additional carbon in β-amino acids in α/β-peptides can lead to an altered arrangement of the amide chromophores and lead to deviations from the classic α-helical CD spectra, high-resolution structural data coupled with CD analysis of previously studied helical α/β-peptides17,84 suggest that the newly discovered peptides in this study also likely fold in a manner similar to aurein 1.2, but with significantly reduced helical rigidities. These results support the use of α/β-peptides as a strategy to enhance antimicrobial selectivity of naturally sourced AMPs without compromising their side-chain presentation pattern.
Newly discovered peptides from our approach were also less helical than the template peptides regardless of backbone type. For example, peptide 4-1 (ααβ-motif) exhibited around 31% helical rigidity while ααβ template peptides had 34–60%; similarly, peptides 5-4, 6-1, and 6-4 (ααβαααβ-motif) exhibited helical rigidities between 24 and 27%, while ααβαααβ template peptides had 37–46%. This decrease in helical rigidity can be attributed largely to the significant reduction in the helicity of peptides in aqueous conditions ([θ]15) rather than changes in helicity in organic conditions ([θ]100) (Fig. 7b and S25†). Similar trends have been observed for homogeneous α-amino acid-containing antifungal26 and antibacterial AMPs,86–89 where the ability of the peptides to adopt a non-helical-to-helical transition upon exposure to helix-inducing conditions (such as at the cell membrane interface, or in TFE) was a critical contributor to selectivity, predominantly driven by the high correlation between helicity and hemolysis.90–93 Overall, these results suggest that the iterative GPR approach discovers higher selectivity peptides largely by decreasing their hydrophobicity and increasing their helical flexibility.
Finally, we sought to quantify more precisely the antifungal activities of lead peptides against C. albicans, the original target of our model, to better understand the extent of improvement in selectivity achieved using our approach. Here, we note that, in all the determinations of MIC in the studies above, we used two-fold serial dilutions to determine the lowest concentration inhibiting over 90% of microbial growth. While conventional, this method yields less accurate representations of true antimicrobial activity that may lie in between the discrete concentrations that are actually tested and is exacerbated at higher ranges (e.g., two compounds with MICs of 64 versus 128 μg mL−1 misses more precise activity comparisons in contrast to MICs between 1 μg mL−1 and 2 μg mL−1). Because MIC is used to calculate SI, reported SI values may underestimate the improvement in selectivity resulting from our approach. Therefore, we measured the MICs of aurein 1.2 and high-SI test peptides at smaller intervals of 5 μg mL−1 (sMIC), as opposed to two-fold serial dilutions (Fig. 7d and S27†) and calculated corresponding ‘smaller interval SIs’ (sSI = HC10/sMIC). As expected, all peptides tested including aurein 1.2 exhibited higher sSI than SI, with 4-1 showing the largest increase. Compared to aurein 1.2, which had an sSI of 1.2, all four newly discovered high-SI peptides demonstrated significantly improved selectivities, ranging from 43.0 to 60.9, indicating up to a 51-fold enhancement in selectivity.
Peptides 4-1 and 6-4, which contain cationic amino acids on their hydrophobic faces, demonstrated the highest sSI despite different backbones (ααβ for 4-1, ααβαααβ for 6-4). These results further support the idea that the unconventional cationic amino acid substitution in the hydrophobic face of α/β-peptides could be a productive strategy to enhance their antifungal selectivity. While unintuitive from a rational design standpoint, similar substitutions, denoted as “specificity determinants” by Hodges and coworkers,94–96 have been reported previously to increase the selectivity of AMPs29 and D-enantiomer analogues94–96 for antibacterial applications, where (1) increases in charge, (2) decreases in hydrophobicity, and (3) decreases in helical rigidity led to a drastic reduction of hemolysis and, in turn, enhanced antibacterial selectivity.94–96 These studies mainly attributed the enhanced selectivity of these AMPs to their altered membrane-active action, citing differences between prokaryotic and eukaryotic membranes, such as net charge, in the context of reduced helical rigidity of peptides. We observed in this study that C. albicans selectivity was also enhanced through a comparable amino-acid substitution, even though C. albicans is eukaryotic. A potential explanation for this is that the higher content of anionic phospholipids in C. albicans cell membranes compared to those of red blood cells (around 20% difference)97,98 may have similarly contributed to the observed increase in selectivity. However, considering that C. albicans cells contain many other components different from human red blood cells, such as the presence of a cell wall97 and different sterols in cell membranes (ergosterol in fungi vs. cholesterol in hRBCs),97 further mechanistic studies on interactions between α/β-peptides and cell membranes are needed. Nevertheless, these results suggest that substitutions of cationic amino acids into the more hydrophobic face of α/β-peptides can be a strategy to improve their antifungal selectivity. A potential next step to enhance the antifungal selectivity of highly antifungal but hemolytic α/β-peptides could involve exploring cationic-amino acid substitutions at other positions in their hydrophobic faces, which can be greatly accelerated by iterative GPR.
Experimental characterization of newly discovered high-SI peptides demonstrated that they exhibit physicochemical properties typical of other highly selective natural AMPs, such as high cationic charge, low hydrophobicity, and low helical rigidity. Furthermore, the most selective α/β-peptides for both α/β-backbones contain unconventional cationic amino acid substitutions near the hydrophobic face, suggesting that incorporating such substitutions could be a potential strategy to enhance antifungal selectivity. This confirms that our approach, based on only sequence and activity data, successfully discovers highly selective peptides that would be far less likely to be identified through conventional rational design. Further understanding the impact of these substitutions on interactions with cell membranes to provide mechanistic insight into peptide activity (e.g., by modelling AMP-induced membrane disruption with computational modelling techniques99) will be a subject of future work. Given the challenges in physically interpreting the descriptors used in this work (Section S9†), physiochemical descriptors extracted from membrane binding and pore formation molecular dynamics simulations100 of model α/β-AMPs (including descriptors related to changes in hydrophobicity and helicity), for instance, could directly support experimental HC10 and MIC measurements in this study.
Overall, our approach circumvents the need for large sets of low-throughput and costly experimental physicochemical data while maintaining good prediction accuracy. Additionally, since the SMILES data representation captures the structural complexities of both backbone and side chain elements in a compound-agnostic manner, such a workflow is transferrable to other classes of antimicrobial peptides and mimetics, such as, but not limited to, those of peptides with γ-amino acids78 or with hydrocarbon-staples.101 We anticipate that our approach, which enabled the discovery of highly selective α/β-peptides with potential for antifungal therapy, can similarly stimulate the discovery of other highly selective peptidomimetics.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4sc06689h |
‡ These authors contributed equally. |
This journal is © The Royal Society of Chemistry 2025 |