Samuel Renaud and
Rachael A. Mansbach
*
Physics Department, Concordia University, 7141 Sherbrooke Street West, Montreal, Quebec H4B1R6, Canada. E-mail: re.mansbach@concordia.ca
First published on 16th February 2023
Current antibacterial treatments cannot overcome the rapidly growing resistance of bacteria to antibiotic drugs, and novel treatment methods are required. One option is the development of new antimicrobial peptides (AMPs), to which bacterial resistance build-up is comparatively slow. Deep generative models have recently emerged as a powerful method for generating novel therapeutic candidates from existing datasets; however, there has been less research focused on evaluating the search spaces associated with these generators from which they sample their new data-points. In this research we employ five deep learning model architectures for de novo generation of antimicrobial peptide sequences and assess the properties of their associated latent spaces. We train a RNN, RNN with attention, WAE, AAE and Transformer model and compare their abilities to construct desirable latent spaces in 32, 64, and 128 dimensions. We assess reconstruction accuracy, generative capability, and model interpretability and demonstrate that while most models are able to create a partitioning in their latent spaces into regions of low and high AMP sampling probability, they do so in different manners and by appealing to different underlying physicochemical properties. In this way we demonstrate several benchmarks that must be considered for such models and suggest that for optimization of search space properties, an ensemble methodology is most appropriate for design of new AMPs. We design an AMP discovery pipeline and present candidate sequences and properties from three models that achieved high benchmark scores. Overall, by tuning models and their accompanying latent spaces properly, targeted sampling of new anti-microbial peptides with ideal characteristics is achievable.
One tool of growing importance in the arsenal of the biomolecular designer is generative deep learning, a subset of deep learning.14 Deep learning, in which a computer optimizes the parameterization of a neural network model in a data-rich regime, has become a powerful player in recent years by leveraging the increasing amounts of scientific data and the phenomenal growth in computational power over the last decades. Neural networks are formed by connecting simple functional units known as “perceptrons” through their inputs and outputs to form complex function approximators. Non-linear activation functions between the linear operations increase model representational capacity at the cost of having to solve a non-convex optimisation task.15 Generative deep learning in specific refers to the use of a model trained from a large body of data to sample new datapoints from an underlying probability distribution intended to mimic as closely as possible that of the training data. One particular variety of generative deep learning model demonstrating competitive performance in the sphere of biomolecular design is the variational auto-encoder (VAE) and its variants,16,17 in which data points are explicitly embedded in a smooth, continuous “latent space” that can function as a search space for optimization.18
In the past five years, there has been an explosion of research in the field of deep generative models for small molecule drug design.18–22 Most work in this area has, until now, has been focused on the design of small molecules, with only a few studies on the design of short peptides,23–25 but in the last year or two research in this area has begun to grow.23 For example, Das et al.23 constructed a Wasserstein Autoencoder (WAE) model that formed a complete pipeline for de novo protein design and identified two sequences with potent antimicrobial activity against E. coli. In another paper by the same team,24 they created a tool, (IVELS), for analysing VAE models, throughout the training process, with which they could select models that perform best. Two other relevant recent studies have reported on the use of an Long Short-Term Memory (LSTM) model to design peptides with experimentally-verified activity against multi-drug resistant pathogens, and the use of generative Recurrent Neural Networks (RNNs) to design non-hemolytic AMPs.26,27
One primary point of interest that is beginning to attract attention is the quality of the latent space itself. Because in data-rich regimes, VAEs and other generative models can be extremely powerful, the focus in past work has been primarily on improving performance.19,28–32 However, construction of an interpretable and well-organized latent space not only tends to make targeted feature sample generation easier, it also significantly improves the ability of the user to apply domain expertise to the problem at hand. Therefore, although we do not ignore model performance, a major focus of this article is on interpretability, trustworthiness, and organization of the latent spaces themselves.
In work by Gómez-Bombarelli et al.,18,31,33 the authors demonstrated that training a property predictor on the latent space and forcing the model to consider the quality of its predictions as part of its overall goal leads to an ordering of the latent space, which is desirable for generating candidates with particular features. Because latent spaces tend towards dimensionalities on the order of 30 or above, this orderedness has been visually presented through projection onto a two-dimensional space, often by using principal components analysis (PCA) to identify the most relevant variables. Since PCA is a linear projection, it remains unclear how well this visual orderedness is preserved in the high-dimensional space, and therefore we make a point to address the trustworthiness of this simple, rapid, and useful analytic technique.
While previous methods have focused on the longer proteins that represent a larger fraction of the known proteome this research focuses on small antimicrobial peptides (AMPs). Much of the previous work discovering protein embeddings with deep neural networks has used large latent space representations34–36 to maximize data throughput or graph-based representations which require the use of graph neural networks to process the protein graphs. In this work we emphasize small latent representations and model interpretability in order to construct interpretable search spaces for AMP design.
More generally, in this article, we focus on the question of latent space quality and interpretability for the relevant test case of generation of novel antimicrobial peptide sequences. We train five different deep generative models with VAE-like latent spaces and assess and compare their different behaviors in terms of reconstruction accuracy, generative capacity, and interpretability. We demonstrate that, as expected, a property predictor leads to an ordering in the latent space. We quantify this ordering and assess how well a PCA projection captures the properties of the original latent space. We argue that the better the PCA projection, the more interpretable the latent space, because we can apply a ranking to linear combinations of the latent space dimensions, allowing us to more easily identify bridge variables that tell us what the model considers important. We also show that the models are capable of generating unique and diverse sequences that grow more AMP-like in specific regions of the latent space.
Out of the total 45120 peptides in the StarPep database, 13
437 of them are labelled as having antimicrobial properties (Fig. S2†). StarPep aggregates bioactive peptides from over 40 existing datasets featuring peptides with sequence lengths between 2 and 100 amino acids. After removing from the dataset the 15% of the peptides featured in StarPep with sequence lengths greater than 50 amino acids and peptides with non-standard amino acids, we retained 35
806 of the original 45
120 sequences, of which 10
841 are labelled as having antimicrobial properties. The remaining 24
965 peptides from Starpep did not have the label antimicrobial.
Although certain deep learning algorithms can perform well when trained on “smaller” datasets of tens of thousands of datapoints,26,27 these are more typically classifiers rather than generators. When we trained our models on the 35806 datapoints from the StarPep database alone, we found that the models demonstrated poor reconstruction accuracy, low robustness, and high error. One method for solving this is to pretrain on a large corpus and fine-tune on a smaller one; however, we chose instead to train all at once on a larger corpus, as has previously been done for natural language models and chemical language models.18–22
Since preliminary analysis on the StarPep database sequences indicated that 35806 was not a large enough dataset to properly train our generative VAE-based models, we expanded the dataset by adding negative examples from Uniprot. After executing the query [(length:[2 TO 50]) AND (keyword:(KW-0929))], we found 714 of unreviewed peptide sequences with the label “antimicrobial” and, placing “NOT” before the keyword, found 3
713
736 of unreviewed peptide sequences without the label “antimicrobial”. We also found 1108 of reviewed peptide sequences with the label “antimicrobial” and 11
941 reviewed peptides without the label “antimicrobial”. Since we would not have been able to significantly expand our dataset by using solely reviewed sequences, we choose to include some non-reviewed sequences. We recognize that in the worse case it is possible that this may include a potentially substantial number of AMPs that have not been identified as such. We considered the possibility of utilizing an existing sequence predictor for AMPs to identify these possibly unlabeled AMPs; however, we decided against it to avoid introducing unknown assumptions into the data at this stage. Instead we consider a different goal for ordering the space. We employ a property predictor that classifies as hits peptides with experimentally-verified antimicrobial properties (which we label with integer 1) and as misses peptides without experimentally-verified antimicrobial properties (which we label with integer 0) and demonstrate that training such a predictor enforces an ordering of the space that allows generation of AMP-like sequences.
To avoid having a prohibitively small percentage of experimentally-verified AMPs in the dataset and to match the sizes of datasets used in previous experiments with similar architectures, we subsampled 10% of the 3 million unreviewed datapoints of Uniprot and selected 268195 to form an additional set of datapoints. Subsampling was done at random, subject to the constraint of retaining a roughly equal number of peptide sequences of each length to ameliorate the extent to which the models focused on this aspect, though there are fewer very short sequences due combinatoric constraints (Fig. S9†).
Together the 268195 peptides from Uniprot and 35
806 from StarPep form our full dataset of 304
001 peptides, of which 10
841 were labeled “verified AMPs” (all from StarPep) and 293
160 were labeled “non-verified-AMPs” (a mixture of Uniprot and Starpep sequences). We note here again that we only labeled those sequences as AMPs that are experimentally verified as such. Although this could introduce a bias into the property predictor from the unreviewed sequences, we believe that this choice is justified because our goal is not to classify sequences but to order the space in a sensible manner. After ensuring there was no sequence redundancy between the merged datasets, we inspected various physicochemical properties of the “verified AMP” and “non-verified-AMP” labeled data sets to ensure they represented a broad distribution over peptide sequence space and had largely similar distributions of various physico-chemical properties (Fig. S10–S19†).
We investigated five different deep learning architectures applied to the task of generating new peptide sequences from a learned distribution, all of which incorporate a variational autoencoder (VAE) element, along with a VAE-like latent space.
The variational autoencoder is a model introduced in Kingma and Welling16 that combines an “encoder” and a “decoder” to form a variational inference model. In specific, given a dataset with a marginal likelihood pθ(x), the objective of the VAE is to learn the parameter set {θ} that most closely reproduces the data's distribution p(x).17,40,41 VAEs operate under an evidence lower bound (ELBO) maximization objective that leads to a joint maximisation of the marginal likelihood pθ(x) and a minimization of the Kullback–Leibler (KL) divergence of the decoder-approximated posterior qϕ(z|x) from the true posterior pθ(z|x), where z represents the embedding variables of the latent space and pθ(z) is the latent space prior, which for the standard VAE is a Gaussian distribution.
There are multiple types of generative models; the two most well-known and often-used are VAEs and Generative Adversarial Networks (GANs). We choose to employ VAE-based models because VAEs boast a number of properties of particular relevance for our specific case of developing smooth, interpretable latent spaces with potential use as search spaces for computer-aided biomolecular design. In comparison to GANs, VAEs have a more natural formulation of an associated latent space, with a rigorous mathematical derivation from Bayesian probability theory.17 The assumption of a Gaussian prior enforces a certain level of smoothness and continuity that is desirable in a design space, and allows for the sampling of Gaussian distributions “near” any defined point in the space. Finally, on a practical note, they have demonstrated utility in the field of de novo small molecule design,18 and we wanted to determine their utility for sequence-based AMP design as well. VAE's also feature an encoder which can directly map new samples of interest to their respective latent vectors, a feature not present, out of the box, in GAN architectures.
The architectures in this research all feature a latent space comparable to that of a VAE, which in theory should constitute the minimal explanation of the data. It is also desirable for the latent space to be ordered in such a way that it is understandable or interpretable for a subject-matter expert. This latent space is used to interpret the model's latent mapping of inputs from the prior distribution. The latent space also allows for visualisation of embeddings and feature clustering of proteins.
In what follows, we introduce the VAE models and their unique differences. Briefly the RNN, RNN-Attention and Trans-VAE models all employ the typical VAE bottleneck with mean (μ) and log of the variance (σ) network layers while the AAE and WAE employ unique loss functions which regularize the latent space to match a set distribution.
We started with three publicly available models modified from work by Dollar et al.,28 in which the authors compared a recurrent neural network (RNN), a RNN with an attention layer and a Transformer to assess their respective capabilities for generating SMILES strings describing novel molecular compounds for drug design. The models all made use of a VAE architecture that generated smooth latent spaces. The results demonstrated the benefits of including self-attention to deep generative models, and the authors concluded there is a trade-off between variational models in terms of their ability to properly reconstruct input data and their ability to embed complex information about the inputs into a continuous latent space. We modified these three models to take as input and output sequences of up to fifty amino acids rather than SMILES strings. In addition, we derived two more models, an adversarial autoencoder (AAE),42 and a Wasserstein autoencoder (WAE),43 by modifying the latent spaces and architectures of the original three.
The RNN model (Fig. 1A) is a traditional VAE featuring a Gated Recurrent Unit (GRU)-based Recurrent Neural Network (RNN) as the encoder and as the decoder.44 The RNN model first embeds the input sequences into fixed length vectors of length 128. The embedded vectors are then passed to the GRU, with N = 3 layers, that generates a hidden layer vector of length 128. The hidden layer is layer-normalized and sent to two separate linear layers, the mean layer (μ) and the log-variance layer (σ). The linear layers act as the “bottleneck” and changes – where necessary – the GRU output shape to the appropriate latent space vector shape – either 32, 64 or 128 dimensions, depending on which architecture is being trained. The mean and logvar output are then combined according to the reparametrization trick and Gaussian noise is added.16 A ReLU unit followed by a layernorm is then applied to the noisy memories. The resulting vector is sent to the decoder GRU. The decoder outputs a 128 dimensional hidden vector that passes through a layernorm. Finally, this hidden vector is sent through a linear layer that generates token predictions. All models except the Transformer implement a GRU with teacher forcing during training. Teacher forcing passes the true sequence to the decoder along with the latent vector and improves training stability by removing the requirement of learning entire sequences and instead prohibits the accumulation of errors on individual tokens.
The Adversarial Autoencoder (AAE) model (Fig. 1B) uses the same GRU model as the RNN but exchanges the re-parameterized latent space proposed in Kingma and Welling16 for an adversarial counterpart Makhzani et al.42 As proposed in ref. 42 this model regularizes the latent space with an attached neural network, named the discriminator, that matches the aggregated posterior q(z) to the desired prior p(z), which in this model is a Gaussian prior. The discriminator is a fully connected network with 256 hidden layers, each with a fixed hidden vector of size 640. The encoder network acts as the embedder for the decoder and the “generator” for the discriminator. The decoder learns to reconstruct original data, as in the RNN model case, and the discriminator discriminates between an ideal prior, modeled by a vector filled with normally distributed noise as a proxy for a Gaussian prior, and the current latent vector. This model is hypothesised to perform well on reconstruction tasks given the relative success of adversarial models on the task of image generation.45,46
The WAE model proposed in Tolstikhin et al.43 introduced a new family of regularized autoencoders under the name Wasserstein Autoencoders that minimize the optimal transport Wc(p(x), p′(x)), where p(x) is the true data distribution and p′(x) is the generative model data distribution, for a cost function c. The WAE's regularizer, computed as part of the overall loss function, forces the continuous mixture to match the prior p(z) instead of forcing each input to match the prior as done in the VAE, where qϕ(z|x = xi) for all input i in x. The WAE constructed in this work (Fig. 1C), uses the MMD penalty for prior regularization.47 The WAE implemented uses the same model as the RNN but exchanges the split latent space with a single linear layer. The regularization is performed in the loss function, where the inverse multi-quadratic kernel basis function is used and the MMD is computed between this latent vector and a vector with values sampled from a normal distribution.
The RNN-Attention (RNN-Attention) model (Fig. 1D), attaches a single self-attention head after the GRU of the RNN model to leverage the attention mechanism.48 The attention mechanism allows a model to store sequence wide connection weights between tokens in a sequence. Each token in the sequence is made to “attend” and give a weight to all other tokens in the sequence thus creating an attention map over all the tokens in the sequence. The single attention layer has been shown to increase interpretability and improve reconstructions over longer sequences.28
The VAE-Transformer (Transformer) model (Fig. 1E), integrates a VAE reparameterization into a full Transformer model as proposed in Vaswani et al.49 and implemented in Dollar et al.28 Transformers have taken the world of sequence2sequence learning by storm and have demonstrated their abilities to create extended relationships between sequence tokens that can be readily interpreted through attention head visualisations.50–52 The Transformer no longer makes use of the GRU and instead first sends the inputs to a self attention head layer that outputs attention weights that are matrix multiplied onto the output vector. The output is then sent to a convolutional bottleneck that filters over the joined attention weights and embedded sequence before sending the output to the mean and log-variance linear bottleneck layers which reduce the dimensionality of the latent vector to one of 32, 64 or 128 dimensions. Once the VAE section is passed and noise has been added to the latent vector the result is sent back to a convolutional layer followed by the masked attention and source attention layers as originally outlined in Vaswani et al.49
All models feature an associated property predictor in the form of a binary classifier that predicts whether the latent representation of a given peptide is an verified-AMP or not. This classifier is a two layer fully connected neural network trained with a binary cross-entropy loss that is joined to the KL-divergence loss and the peptide reconstruction accuracy loss of each model.
Hyperparameter | Value |
---|---|
Batch size | 200 |
β final | 5 × 10−2 |
β initial | 1 × 10−8 |
Adam learning rate | 3 × 10−4 |
Epochs | 300 |
Model training hyperparameters are outlined in Table 1.
In Fig. 2, we plot the average position-based reconstruction accuracy 〈Racci〉 versus amino acid index for the five models and for three different latent space sizes, and, in Fig. 3, we plot the entire-sequence accuracy. The position-based reconstruction accuracy measures the model's ability to predict the correct amino acid at a given index in the sequence and is defined as,
![]() | (1) |
The mean overall sequence accuracy 〈Racc〉 is a measure of how many full sequences the model was able to correctly reconstruct in their entirety,
![]() | (2) |
Models of 64 dimensions or greater generally display moderate to high performance on the reconstruction task. They exhibit an accuracy of at least 60% for tokens up to position twenty (Fig. 2), of interest because most AMPs have sequence lengths of under twenty amino acids. The entire-sequence reconstruction accuracy (Fig. 3), is between 50% and 70% for the 128-dimensional AAE, RNN, RNN-Attention, and WAE, and about 97% for the Transformer, although reconstruction accuracy diminishes with decreasing latent space dimensionality.
It is of interest to note that in all models except the 128-dimensional Transformer, we observe an almost monotonic decrease in reconstruction accuracy for tokens later in the sequence. All 32 dimensional models, except the Transformer approach the accuracy of a random guess (5% for any one of twenty possible amino acids) near the 50th position. This clear length-dependent effect on accuracy is expected behavior arising from the increasing difficulty of the predictive task for each successive token position: the prediction depends primarily on the previous tokens in the sequence, leading to a compounding error effect. The difficulty of this task is also exacerbated by model training with teacher forcing, which corrects mistakes earlier in the sequence during training but not during testing. The 128-dimensional Transformer clearly has the capacity to represent and retain entire sequences in a holistic manner, as it was intended to do.
Indeed, the Transformer model achieves by far the highest accuracy (97%, 128 dimensions) but is also most affected by a diminishing latent dimensionality: the mean accuracy drops precipitously from 97% to 26% when going from 128 dimensions to 64 dimensions, whereas the AAE and the RNN both achieve nearly 50% reconstruction accuracy with 64 dimensions. All models display increasing accuracy with increasing dimensionality. All attention based models display a greater improvement when increasing from a dimensionality of 64 to a dimensionality of 128 than when increasing from a dimensionality of 32 to a dimensionality of 64.
We employ a simple binary classifier trained to predict whether or not a given sequence falls within the experimentally validated AMPs category, or the unvalidated category. Although such a predictor can be of use in terms of assessing the specific properties of some given sequence of interest, in this case we use it rather as a tool to order the space such that we may expect interpolation between points in the latent space to generate sequences with desirable properties. In other words, for our case the property predictor primarily functions to ensure that the neighboring points of an experimentally-verified AMP in the latent space will display similar properties. We can consider this to be a way of enforcing a meaningful distance metric and quantifying how well we have done so. In Sec. 3.4, we will also address the question of how meaningful it is for us to identify this “ordering” in a linear projection rather than in the full latent space.
After model training with the binary classifier, we perform principal components analysis on the associated latent space of each model (32-dimensional, 64-dimensional, or 128-dimensional, respectively) and project into the top five highest-variance components of the PCA decomposition. Based on previous studies that employed PCA for two-dimensional projections, we expected to observe latent space ordering–that is a partitioning of the data points according to the label predicted by the property predictor – in the top one or two principal components of each model. To our surprise, this was not the case (Fig. 5). Further investigation revealed at least some ordering in one or two of the top five principal components for 4/5 of the models. The fifth model, the RNN-Attn, performs so poorly we do not expect to observe ordering no matter how many PCs we analyze. This indicates that despite being explicitly “instructed” – via the loss function – of the importance of the verified-AMP/non-verified-AMP labeling, most models do not “pay attention” to this to the exclusion of other characteristics of the sequences. We may quantify this idea further by investigating the fractional variance explained by different PCs as a proxy for the amount of information contained. When the extent to which these different components capture the overall latent space variance is assessed (Fig. S3†), we observe that the models differ significantly in how much fractional variance explained (FVE) is contained within the top five PCs, ranging from only about 10% in the Transformer-128 model to over 40% in the AAE-128 model. The more variance is contained within the top five PCs, the more it seems to be concentrated in the first PC—AAE, RNN demonstrate a much more significant drop from PC1 to PC2 compared to the other three models—overall we observe. Trends are largely independent of latent space dimensionality. In general, the Transformer and WAE models of various sizes contain the least FVE in their top five components, whereas the AAE models contain the most. We observe that in all models other than the Transformer, PC1 is correlated with molecular weight or length (see Sec. 3.4), with all other variances being of roughly equivalent magnitudes. Overall, we may interpret this as telling us that the models place greatest importance on length, while other descriptive variables, even those we attempt to emphasize a priori, are assigned similar levels of importance.
We quantitatively assess the extent to which the clusters of AMPs and non-AMPs are separated from one another in the model latent spaces and pairs of the top five principal components (Fig. 4). We do so through the use of the Silhouette score53 applied to the known labeling of the dataset peptides, which is a well-known metric for assessing cluster distinctness.
![]() | (3) |
![]() | ||
Fig. 4 The maximal Silhouette scores, SS, for the five models with corresponding three latent sizes. Silhouette scores are shown for the principal components pairs featuring the largest score of the top five PCs calculated with PCA (see Fig. S4† for Silhouette scores of all pairs of top five PCs). The PC combinations from left to right are, AAE: [2,3], [1,5], [3,4], RNN: [4,5], [1,2], [1,2], RNN-Attention: [3,4], [3,4], [2,3], Transformer: [1,5], [2,4], [1,3], WAE: [2,5], [2,4], [2,5]. The error bars are generated by bootstrapped sampling of the latent space and calculating a 95% confidence interval computed with the assumption of Bernoulli sampling. The flier points indicate outliers from the interquartile range of the whiskers. |
We use the Silhouette score as our metric because its boundedness gives it a natural interpretation in this context. Scores that are too close to zero indicate that no clustering has occurred, whereas scores too close to one indicate a disconnected latent space, which might mean the model had failed at satisfying the Gaussian prior. Ideally, therefore, we would observe moderate Silhouette scores, and indeed we do. The silhouette score for each PC pair is calculated by taking subsamples of PC pair vectors and computing their Silhouette score relative to the known verified-AMP/non-verified-AMP properties. Once this has been computed for all pairs of PC's [1–5] we then find the pair that has the highest silhouette score.
If we rank the models by maximum Silhouette score in all pairs of the top five PCs, the Transformer and AAE perform the best, reaching a maximum of SS = 0.51 ± 0.02 for the Transformer at 128 dimensions, and a maximum of SS = 0.3 ± 0.03 for the AAE at 128 dimensions, in PC's [1,3] and PC's [2,3] respectively, with both of them demonstrating moderate maximal clustering ability and an increase in clustering ability with latent space dimensionality. The WAE demonstrates poorer performance, displaying a maximum of SS = 0.24 ± 0.02 in PC's [2,4] at 64 dimensions this performance is largely independent of latent space dimensionality. The RNN and RNN-Attention demonstrate the worst abilities as measured by the Silhouette score, with a maximum of SS = 0.1 ± 0.02 in any of the top five PCs for the RNN and SS = 0.02 ± 0.02 for the RNN-Attention.
To better visualize the latent space, in Fig. 5, we plot the test set of sequences embedded in the reduced latent representation for each model and color the scatter points according to whether they feature verified-AMPs (orange), or non-verified-AMPs (blue), characteristics. We observe an agreement between the Silhouette scores previously discussed and the visual clustering in the scatter plots. The AAE-128 and Transformer-128 models form more distinct clusters than the other models, though all clusters still overlap as desired.
![]() | ||
Fig. 5 Scatter plot of maximally AMP-separating PCs presented as a visualization of the different latent spaces. We embed the test sequences into the latent space, perform PCA, and illustrate the PC pair corresponding to the largest Silhouette score (cf.Fig. 4). The PC combinations are as follows, AAE: [2,3], [1,5], [3,4], RNN: [4,5], [1,2], [1,2], RNN-Attention: [3,4], [3,4], [2,3], Trans-former: [1,5], [2,4], [1,3], WAE: [2,5], [2,4], [2,5]. Orange points denote verified-AMP sequences, and blue points denote known non-verified-AMP sequences. Different models correspond to different visible separations in the latent space, though most models demonstrate at least some separation of AMPs and non-AMPs. |
In summary, the latent spaces of the models were found to exhibit ordering. We quantified the ordering with the Silhouette score. Surprisingly, we found that AMP correlations are not always present in the first or second PC pairs and as such it is important to investigate downstream PCs before ruling out potential correlations in the model latent space. Once a correlation has been identified, the variance explained per PC can serve as an approximate weight factor indicating the importance attributed to this ordering by the model. For the 128-dimensional AAE, summing the variance of the two top PCs, this is about 10%; for the 128-dimensional Transformer, it is about 4%.
In contrast to the previous section, where we focused on PC pairs, in this section, for ease of calculation and visualization, we employ only a single variable per model variant. We choose the principal component (PC) having the largest Silhouette score with verified-AMP/non-verified-AMP labeling (see Fig. 9A and further discussion of PC identification in Sec. 3.4). We identify the “width” of the latent space by calculating the mean and standard deviation over the embedded sequences. Then we sample from ten evenly spaced regions that form a “line across” the latent space, limiting the min and max values of the region centers to ±4 times the standard deviation. The random sampling performed in each region follows a normal distribution with the mean determined by the region center and the standard deviation set to 20% of the standard deviation of the latent space. For external validation on a large number of generated sequences – difficult to perform experimentally – we assess in each neighborhood the probability of generating AMPs ρAMP (Fig. 6) by employing a previously published QSAR model. AMPlify is a deep neural network AMP classifier that has been shown to achieve 93.71% accuracy on the task of AMP classification.54
![]() | ||
Fig. 6 Traversing the PC's with highest AMP correlation from the smallest embedded value of each PC (min) to the largest embedded value (max), scaled to the same axis for comparison. The AMPlify QSAR 48 classifier54 is a binary classifier that returns true or false and an average AMP probability is taken over the 100 local samples. All 5 models and their respective latent dimension variants are plotted. The error bars are generated by calculating a 95% confidence interval on the AMP probabilities computed with the assumption of Bernoulli sampling. |
As in the previous section, where we demonstrated clustering of the test set embedded in the latent space, we expected to observe partitioning of each PC into areas of low and high AMP probability, and indeed in most cases we did. Because the task given to the models by the property predictor was essentially one of partitioning but there was no constraint on the manner in which to perform the partitioning, a partitioning of the latent space in any direction is equally expected; that is, higher probability of AMPs could be localized in various different areas across the PC.
We observe little or no (ρAMP < 0.4 for all tested values) partitioning for the 32- and 64-dimensional AAE and RNN-Attention. We observe a semi-linear partitioning – as the PC is increased or decreased, an area of zero AMP probability is followed by a monotonic, almost linear rise in probability – for all other models except the 128 dimension RNN-Attention. Interestingly, the 64 dimensional RNN-Attention model shows a non-linear partitioning of its space with a region of high AMP probability, even though the PCA visualisation and the silhouette score for this model did not demonstrate a clustering of AMPs in its latent space. Thus, we show that predicting experimentally-verified antimicrobial properties, even though many of the non-hits may also be AMP-like in nature, is in many cases sufficient to partition the space into regions of high and low antimicrobial peptide probability in general.
We next evaluate the local properties of the latent space as we interpolate across the individual PCs. In addition to demonstrating generative partitioning in the latent space, we assess the following quantities in each neighborhood and take a global average across the neighborhoods to gain a general understanding of latent space properties in the vicinity of the AMPs: (i) sequence similarity ρsim, and (ii) Jaccard similarity of 2-mers (J2) and 3-mers (J3) in the sequence.
We assess the ability of the model to generate sequences that are dissimilar to one another through the uniqueness,19 〈μ〉, which we compute as an average probability by generating a random sample si of 100 sequences and counting the number that are distinct from one another,
![]() | (4) |
We compute the pairwise similarity between all generated sequences in the neighborhood using the pairwise2 command from the Biopython55 package, with the Blosum62 (ref. 56) substitution matrix as the scoring function. Pairwise similarity is measured by identifying an optimal sequence alignment from bioinformatics arguments and then computing the scores of amino acids that exhibit matching physicochemical classes.
The Jaccard similarity of two sets is defined as follows,
![]() | (5) |
We assess the Jaccard similarity between all pairs of peptides in the local neighborhood by considering the sets of overlapping k-mers that describe the sequence, where a k-mer of a peptide is a subsequence of length k, and the spectrum of k-mers has previously been demonstrated to contain significant information about AMP properties.57 We consider k-mers of lengths two and three in this analysis. 0 ≤ J(A,B) ≤ 1 by design, and a Jaccard similarity of zero implies that two sequences share no subsequences of length k whereas a Jaccard similarity of one implies that two sequences are composed of identical subsequences.
The sequence similarity score is a relative evaluation based on the Blosum62 matrix scores, whereas the Jaccard similarity score is an absolute measure of the similarity between two sequences where identical sequences return a score of one. It should be noted, however, that the number of matches grows more quickly than the size of the set – i.e. the numerator of the Jaccard score grows more quickly than the denominator. This means that a comparatively low Jaccard score can still indicate a reasonable amount of similarity. Very roughly, for the 2-mers, a Jaccard score of 0.06 is the expected value for two random sequences of length 50, and it is significantly lower than that for the 3-mers.
The size of the latent space is not seen to have a drastic impact on the sequence similarity of peptides found in a local neighbourhood for any of the models. An interesting dynamic is displayed for the some models, in which as the center of the latent space is approached the samples become slightly more diverse. This is seen by the “U” shape seen especially in the RNN, and WAE models. This is likely due to the Gaussian nature of the space packing more peptides near the center thus increasing sampling diversity.
The Jaccard similarities for the 2-mers and 3-mers are rather low, but still significantly higher than would be predicted by random chance (Fig. 7). Due to the nature of the Jaccard score, this indicates a balance between shared and dissimilar fragments, particularly for the Transformer model. The 128 dimensional Transformer has significantly higher 2-mer and 3-mer similarity scores, from 0.24 to 0.34 for 2-mer, and 0.11 to 0.26 for 3-mer, than all other models <0.22. This is a good indication that local points in the space share similar constituent sequence fragments. Interestingly, we observe heightened similarity for the RNN and WAE models in the region corresponding to heightened AMP probability (i.e. towards the PC maximum value), whereas for the Transformer-128, the trend is opposed: there is a region of heightened similarity near the minimum corresponding to a low probability of AMP-like sequence generation, and a region of somewhat lower probability near the maximum corresponding approximately to the region of heightened probability of AMP-like sequence generation.
Overall, most models perform similarly when reconstructing peptides in a local neighbourhood and this applies to both entire sequences and local k-mer segments. In Sec. 6.1 in the ESI,† we also show that the models generate disparate peptides not found in the testing or training sets, as desired.
We measure PCA distortion of the latent space with four metrics: trustworthiness , continuity
, steadiness
, and cohesiveness
.58 Trustworthiness is a measure of the introduction of false neighbors and the loss of true neighbors by a point k when projected from high to low dimensions,
![]() | (6) |
Continuity is a measure of the introduction of false members into a group–a cluster of points–or the loss of true members from a group when projected from high to low dimensions,
![]() | (7) |
Steadiness is a measure of the loss of existing groups, and cohesiveness is a measure of the introduction of false groups. Briefly, the steps to calculate the steadiness and cohesiveness are outlined below. We direct the interested reader to Jeon et al.58 for further details. First, one computes the shared-nearest neighbor distance between points in the original space and in the projected space and constructs a dissimilarity matrix identifying compression and stretching of point pairs. Then average partial distortions are computed by randomly extracting clusters from one space and evaluating their dispersion in the opposite space. Once such partial distortions are known one can aggregate the results into steadiness and cohesiveness,
![]() | (8) |
![]() | (9) |
For all models except the Transformer, scores are >0.5 for all four metrics, and the scores tend to hover around 0.75, with some exceptions (Fig. 8). While the 128-dimensional Transformer performs best according to the steadiness score (∼0.8), the Transformer model in general performs worse on all other metrics than the other models, particularly cohesiveness, for which all three Transformer models perform worse than any other model, and in particular the 128-dimensional Transformer displays an exceedingly poor performance of just over 0.25.
Overall we observe it is possible to construct linear projections of model latent spaces with comparatively low overall distortions; however, the model that has thus far performed particularly highly on other metrics (the 128-dimensional Transformer) has the greatest distortion, particularly in terms of cohesiveness, meaning it potentially introduces a significant number of false groupings. The Transformer's comparatively high distortion overall underscores one of the traditional trade-offs of machine learning: with greater power comes lessened interpretability.
Now that we have an idea of how confident we may be in the representation of the PCs for the model latent space, we investigate the identification of “bridge variables” for further interpretation of the latent space. Bridge variables are known quantities of relevance in a scientific problem that show correlations with unknown variables, allowing heightened interpretability of many nonlinear problems.59 We consider a series of physicochemical properties of peptides that are measurable from sequence alone: aliphatic index, Boman index, isoelectric point, charge pH = 3, charge pH = 7, charge pH = 11, hydrophobicity, instability index, and molecular weight. We measure each of these properties for the test set sequences using the peptides python package.60 We employ the Silhouette score in one dimension to measure correlation between individual PCs and verified-AMP/non-verified-AMP labeling, and we use the Pearson correlation coefficient (r) to measure correlation between individual PCs and the continuous physicochemical properties.
The aliphatic index61 is defined as the relative volume occupied by the aliphatic side chains,
Aliphatic index = X(Ala) + aX(Val) + b(X(Ile) + X(Leu)), | (10) |
The hydrophobicity is a measure of the degree to which the peptide is hydrophobic, calculated by averaging the hydrophobicity values of each residue by using the scale proposed in Kyte and Doolittle.64 The instability index predicts whether a peptide will be stable in a test tube as presented in Guruprasad et al.65 The molecular weight66 is the average molecular weight of the peptide found by summing the individual masses of its amino acids and is directly correlated with sequence length.
In Fig. 9, we identify the single PC with the highest correlation to each one of the potential bridge variables and illustrate the value of that maximal correlation. We note that although there is a comparatively low Silhouette score in one dimension indicating that the score in two dimensions is a more appropriate quantification of ordering in the system, we may use it to indicate which PC the models consider “most” relevant for verified-AMP ordering, and thus identify whether those PC's simultaneously correlate with physicochemical properties. For example, the 32-dimensional AAE, 64-dimensional WAE, RNN, Transformer and 128-dimensional AAE and RNN models employ one of the top 5 PC's most strongly for AMP ordering (Fig. 9A), and the same PC is also correlated with charge pH = 7 (Fig. 9B) and isoelectric point (Fig. 9F) for this model (Fig. 9H).
For all models but the Transformer, the first principal component is highly correlated with molecular weight (Fig. 9I), which makes sense for the RNN, AAE, and WAE, as all are length-dependent models. That the RNN-Attention model also exhibits this behavior demonstrates the need to commit fully to a Transformer model to avoid a significant component of the model's variance being devoted to sequence length. The comparative performance of the AAE with the Transformer shows, however, that it is not necessary to remove the length dependence to enforce ordering capability in the model, as long as more than the first and second PC are considered. In a more general sense, any model with a high-variance component built in as part of the architecture will likely demonstrate this behavior.
In these plots we find that the models will distribute the correlation across all of the first 5 PC's. Generally the transformer model is observed to make use of the first PC more often, and this is expected to be because it does not feature a length based correlation in the first PC.
Using these correlation coefficients it is possible to further characterise the ordering of the latent space and define directions along which interpolation should occur such that desirable characteristics will emerge from generated peptides. Each PC can be interpreted as a linear combination of the respective model's latent dimensions (32, 64 or 128). From this interpretation it is possible to construct a direct relation between the latent dimensions of the model and the physicochemical properties investigated through the PCA mapping; we have made the PCA mappings themselves available on the project Github repository, https://github.com/bcgsc/AMPlify
The pipeline takes an AMP of interest as input. The AMP is passed through the encoders of the different models and transformed into a model-specific latent representation. The AMP latent vector is appended to the embedded training data and is transformed into linear representations with PCA (Fig. 10).
We then proceed to choose variables of interest to hold constant in the latent space and perturb the others. This will allow us to find new peptides with the properties of interest maintained. We choose to maintain the verified-AMP-correlated PC and the hydrophobicity-correlated PC, to produce the sequences likelier to share experimental AMP-likeness and hydrophobicity – an important AMP property – with a chosen sequence not in the training set. We note that this procedure could be done with any identified property near any input point of interest. We study a synthetic β-sheet forming peptide called GL13K.67 We use the previously mentioned correlation coefficients (Sect. 3.4), to find the verified-AMP correlating PC and the top hydrophobicity correlating PC. Having kept track of the PCA vector for our AMP of interest, we perform sampling by keeping our two PC's of interest, AMP and hydrophobicity correlated, fixed and adding Gaussian noise to all other PC's. The added noise variance must be tuned individually for each model as the latent space organization is different for each model. The variance tuning is performed by identifying the mean and standard deviation of all training dataset points in PCA space and then sampling Gaussian noise centered at the mean with 1/5 the standard deviation. This noise vector is then summed with the PCA vector of our AMP of interest, thus shifting the sampling location to be near the AMP of interest. Once we have the noisy nearby PCA samples we perform an inverse PCA transformation returning to our latent space vector representation, demonstrating one particular valuable property of the PCA approach. We then pass the noisy latent samples to the decoder which will generate the desired candidate peptide sequences.
We demonstrate the use of the pipeline on GL13K. For each model, we generate five different sequences and display their properties (Fig. S6–S8†). In the ideal case we should observe little change in Hydrophobicity of the samples while the other physicochemical properties should vary and in general this is what we observe. We observe that certain properties of random sequences generated in the neighborhoods tend to remain more constant irrespective of model (aliphatic index, charge at pH = 3, molecular weight, isoelectric point) and certain properties tend to be more variable (Boman index, charge at pH = 11, instability index), while certain properties are more dependent on the model. Hydrophobicity varies the most in the AAE model, and the least in the WAE model. One benefit of employing multiple models is the ability to sample different local neighborhoods of the input point, with potentially different properties; another benefit is including both more interpretable and more high-performing models to generate samples.
The results from the pipeline for the AAE-128, Transformer-128 and WAE-128 presented in the ESI figures (Fig. S6–S8†) show that by locking PC's of interest and sampling nearby points along other PC's we can generate novel peptides that have similar properties to the original GL13K. While some generated samples feature drastic changes in certain properties such as the fourth peptide from the AAE-128 with a charge of 1 at pH 11 or both generated peptides with −1 charges at pH 11 for the Transformer-128 and WAE-128 models, most properties fall near the original GL13K sequence properties.
This general pipeline can be extended to any peptide function by identifying the top PC's correlating with the desired function and keeping those fixed as noise is added to the remaining PCA vectors in order to sample in the functional neighbourhood it could be employed for.
We have investigated the use of the principal components with the highest clustering of verified AMP properties for de novo AMP generation and showed that our models generate highly diverse and unique sequences, with comparatively low sequence similarity in local neighborhoods. Despite the low similarity and the use of a predictor trained on experimentally-verified AMP properties rather than direct knowledge of AMP-like-ness, all models but the 32 and 128-dimensional RNN-Attention and 64-dimensional AAE are capable of successfully partitioning a single coordinate in the latent space into regions that generate AMP-like sequences with high probability and regions that generate non-AMP-like sequences with high probability. We observe that the capacity of the model to reconstruct input sequences is not clearly linked to its ability to partition the space, and we add our voices to a number of cautions against the over-use of reconstruction accuracy as a metric for generative models.
We have evaluated the extent to which models order their latent spaces according to standard peptide physicochemical properties that they are not trained on and identify the principal components most strongly correlated to given properties. We find that the models will order any of the first five principal components investigated according to physicochemical properties but when a model needs to assign a larger proportion of the variance to learning peptide length the first component is usually correlated with length/molecular weight. Indeed, only the 128-dimensional Transformer eschews length as a consideration almost entirely (no length dependence observed in the reconstruction accuracy, small correlation between any top five PC and molecular weight). The Transformer in general clearly has the capacity to function independent of the length but demonstrates a more rapid drop in performance as the latent space dimensionality is decreased than the RNN, WAE, or AAE. We speculate that this is due to the nature of the task: discriminating between lengths can actually discourage models from overfitting; that is, from simply “memorizing” the answers, and may encourage a more meaningful lower-dimensional representation, although we also note that the 128-dimensional Transformer shows by far the most heightened local fragment-based similarity in its verified-AMP-relevant PC. This could suggest a model relying on fundamentally different information from the others.
In terms of other relevant physicochemical variables, we observe moderate correlations (0.35 ≲ rPearson ≲ 0.65) between at least one PC and isoelectric point/charge at pH 7 in the AAE-64, RNN-32, RNN-64, Transformer-64 and all WAE models and moderate correlation between at least one PC and hydrophobicity in all but the RNN-Attention-32. As these are traditional hallmarks employed for AMP design, this is desirable behavior, and in particular aids in the interpretability of the models through a linear mapping of the latent space variables to straightforward “bridge variables” for most models. It also shows that the models are capable of identifying relevant properties from sequences alone, despite being trained only with a binary property predictor.
We may further employ the bridge variables in conjunction with the probability of AMP generation in a single PC (Fig. 6) to aid in the interpretability of the models. In our case models demonstrating a monotonic linear increase in probability, especially those reaching a prediction probability of >0.6 (128-dimensional AAE, 64-dimensional RNN, 64-dimensional RNN-Attention, 64 and 128-dimensional Transformer and 32 and 64-dimensional WAE models) are arguably the easiest to interpret, since we now essentially have a single linear mapping from the latent space to AMP probability, which for the RNNs and WAEs are comparatively non-distorted from the original space (cf. Sec. 3.4).
We demonstrate a trade-off between model complexity and model interpretability under this paradigm and suggest that for optimized design of AMPs in a continuous latent space, it may be appropriate to perform the optimization in multiple different latent spaces, using a similar philosophy to that of ensemble voting. We do a short case study to show one way this might be implemented, and indeed, in future work we plan to use this as a starting point for an active learning procedure to traverse these spaces and perform multi-scale molecular dynamics simulations upon relevant points.
In the future, we plan to investigate a phenomenon termed selective latent memories due to Kullback–Leibler divergence constraints. This effect is observed during training and causes a drop in the entropy of certain latent dimensions when the KLD is minimized. In addition, we will generalize from a binary AMP/non-AMP classifier to a multi-class predictor capable of grouping sequences by expected mechanism of action. Finally, as a prelude to the previously mentioned active learning traversal of the space, we plan to investigate the incorporation of structural data into the models, perhaps leveraging the recent success of AlphaFold2 (ref. 14) and similar structural prediction algorithms.
The models developed in this research used deep learning to discover embeddings for sequences of amino acids, but future work should investigate other peptide representations, such as protein structure distance graphs which can embed structural information and the SMILES strings used in the world of small drug design. SMILES strings encode chemical information into a sequence of characters, thus allowing the models to learn chemical distinctions between the amino acids.
Overall, we have performed a thorough qualitative and quantitative analysis of the latent spaces of five different generative models for AMP design, identifying strengths and weaknesses of each, as well as developing a suite of analysis methods for latent space design and sampling in the context of generative deep learning of AMP sequences. We provide a much-needed set of benchmarking protocols in this nascent area of research.
Footnote |
† Electronic supplementary information (ESI) available: With supplementary figures and tables. See DOI: https://doi.org/10.1039/d2dd00091a |
This journal is © The Royal Society of Chemistry 2023 |