Liwei
Chang
a,
Alberto
Perez
*ab and
Ramón Alain
Miranda-Quintana
*ab
aDepartment of Chemistry, University of Florida, Gainesville, FL 32611, USA. E-mail: quintana@chem.ufl.edu; perez@chem.ufl.edu
bQuantum Theory Project, University of Florida, Gainesville, FL 32611, USA
First published on 29th November 2021
We present new algorithms to classify structural ensembles of macromolecules based on the recently proposed extended similarity measures. Molecular dynamics provides a wealth of structural information on systems of biological interest. As computer power increases, we capture larger ensembles and larger conformational transitions between states. Typically, structural clustering provides the statistical mechanics treatment of the system to identify relevant biological states. The key advantage of our approach is that the newly introduced extended similarity indices reduce the computational complexity of assessing the similarity of a set of structures from O(N2) to O(N). Here we take advantage of this favorable cost to develop several highly efficient techniques, including a linear-scaling algorithm to determine the medoid of a set (which we effectively use to select the most representative structure of a cluster). Moreover, we use our extended similarity indices as a linkage criterion in a novel hierarchical agglomerative clustering algorithm. We apply these new metrics to analyze the ensembles of several systems of biological interest such as folding and binding of macromolecules (peptide, protein, DNA–protein). In particular, we design a new workflow that is capable of identifying the most important conformations contributing to the protein folding process. We show excellent performance in the resulting clusters (surpassing traditional linkage criteria), along with faster performance and an efficient cost-function to identify when to merge clusters.
Thanks to the improved supercomputer architecture and software development, these tools can now routinely simulate events well into microseconds,2,3 bringing insight into a plethora of biological processes. Events that require longer timescales, such as binding or folding, rely on advanced sampling techniques that preserve the thermodynamics and in some cases the kinetics as well.4–9 Simulations produce ensembles of structures which require a statistical mechanics treatment to identify relevant states of interest from the computational results.
Most often, we are interested in identifying the metastable states in the system, and the microstates that belong to them. Traditionally, this is done by using some similarity measure between conformations (often a 2D-RMSD calculation).10 An alternative approach relies on projecting trajectories onto a set of features, and then using some dimensionality reduction approaches (PCA or tICA).11 In both cases, a clustering method classifies structures into different states—with the largest population state corresponding to the lowest free energy basin. In the case of folding or binding predictions, the centroid of the most populated cluster is compared against experimental structures. However, consideration of larger time- and length-scales inevitably means an explosive growth in the number of structures/conformations that must be analysed. This means that developing more efficient data analysis techniques is paramount to study more realistic biological systems.
Here we propose a novel take on this problem by applying our recently developed extended similarity indices12,13 to the study of conformational ensembles. These indices generalize many of the most popular similarity measures commonly used in cheminformatic studies, but they have a key difference: while standard indices can only quantify the similarity between two objects (e.g., they are binary functions),14,15 our indices can calculate the similarity between an arbitrary number of objects at the same time (e.g., they are n-ary functions). A key advantage of this approach is that we can perform the similarity calculations with unprecedented efficiency:16 calculating the similarity of N objects using binary indices scales as O(N2), while the n-ary indices scale as O(N). Moreover, the truly global nature of our extended indices provides a more natural description of the similarity of a group of objects, which results in better estimates of set compactness. We recently showed how to compute the n-ary indices for sets of vectors formed by an arbitrary number of categorical characters17; however, here we will apply the original formalism that takes binary vectors (bitstrings) as input. This not only leads to more efficient calculations, but also provides a more convenient way to encode the 3D structure of the molecules studied via contact maps (e.g., we can represent a conformation as a vector of 0s and 1s, where a contact between two residues can be considered a 1 and its absence a 0).
We showcase the performance of our framework in several datasets relevant to an array of computational biology problems ranging from protein folding to binding events between proteins and other macromolecules. We show the accuracy and reliability of the method as well as the agreement of predicted states with previous studies. The analysis of the protein folding landscape is particularly remarkable, since we are able to determine the most representative structures contributing to this process with minimum computational cost. Additionally, it is important to highlight that many of the tools that we develop here can have far-reaching applications in data analysis. For instance, we present a linear scaling algorithm to find the medoid of a set. We show how the extended similarity indices can be used as novel linkage criteria in hierarchical clustering algorithms, with results that are consistently more robust than those obtained with other linkages. It is also noteworthy that extended similarity-based clustering provides a natural and efficient way to estimate the number of clusters present in a dataset, an important hyperparameter that is often quite difficult to determine. Overall, our approach shows great promise at the time of analysing large and complex datasets, which can be easily generalized towards on-the-fly similarity comparisons due to the trivial cost of increasing the ensemble size.
System | Selected residue | Cutoff (nm) | Ensemble size | #Features |
---|---|---|---|---|
NTL9 (2HBA) | 1–39* | 1.0 (Cα) | 5000 | 380 |
α3D (2A3D) | 1–73* | 1.0 (Cα) | 5000 | 1332 |
NuG2 (1MI0) | 1–56* | 1.0 (Cα) | 5000 | 784 |
PDIQ-MDM2 | 1–85*:86–97 | 0.7 (heavy) | 5000 | 516 |
P53-MDM2 | 1–85*:86–100 | 0.7 (heavy) | 5000 | 645 |
ATSP7041-MDM2 | 1–85*:86–101 | 0.7 (heavy) | 5000 | 688 |
1NS1 (homodimer) | 1–742:75–146 | 0.8 (heavy) | 3500 | 2664 |
2MMA (heterodimer) | 1–107:108–187* | 0.8 (heavy) | 3500 | 4280 |
Protein G (3GB1) | 1–56 | 0.6 (heavy) | 1923 | 1540 |
Protein-DNA (1DH3) | 1–27,57–81:111–152 | 0.6 (heavy) | 1516 | 2132 |
Extended similarity calculations: The extended similarity indices were calculated using the fractional weight function, with minimum possible coincidence threshold in each case (e.g., for N datapoints, the minimum coincidence threshold is N mod 2). The code to perform the extended similarity calculations can be found in: https://github.com/ramirandaq/MultipleComparisons. A short example illustrating the calculation of the extended Russell–Rao index can be found in Appendix I.
The rationale behind this approach is particularly apparent in our present case when we are representing the different protein conformations via contact maps. First of all, this provides a natural definition for Sij, which we can take as counting the number of coinciding “on” bits in the binary vectors of i and j. It is important to remark that (contrary to some similarity indices commonly used in cheminformatics) we must ignore the coincidence of 0s. While the coincidence of 1s means that a given contact is present in two frames, the coincidence of 0s does not mean that two frames are similar, because in one structure the two residues could be at 10 Å and in another at 20 Å. Notice then that the bitstring of the folded state (by virtue of being compact) should have the maximum number of “on” bits. This, together with the fact that we expect to have many conformations that are close to the native structure, explains why Gi should be a maximum for this conformation. We can test this hypothesis by seeing how the group similarity is correlated with the RMSD with respect to the native structure. In Fig. 1, we show the corresponding plots for two types of systems, showing that indeed Gi can be used to determine the bound state. (The corresponding figures for other systems are shown in the ESI.†)
However, despite the promise of this approach, the fact that it hinges on binary similarities means that it has a fundamental drawback: if we have N structures, determining the native configuration will scale as O(N2), since we need to compute the full similarity matrix [Sij]. This is once again in line with the medoid calculation problem, for which most algorithms scale quadratically.
An alternative route to bypass this issue would be to use n-ary similarity indices. For instance, given a set of N conformations: C = {C1, C2,…,CN}, we can calculate the complementary similarity of any element, (Ci), as the extended similarity, Se, of the original set without the corresponding element. That is:
(Ci) = Se(C/{Ci}) | (1) |
At this point, we can realize that the structure with the highest group similarity also has to correspond to the one with the lowest complementary similarity. The relationship between these two indicators can be seen in Fig. 2, where we show how Gi is almost perfectly correlated (e.g., is consistent)24,25 with the (Ci) calculated using the extended (non-weighted) Russell–Rao (RR) index.
Fig. 2 Correlation between group similarity and extended similarity using the non-weighted Russell–Rao index for NTL9 (left) and A3D (right). |
In other words, we can also use (Ci) to locate the most populated state. The (very) attractive advantage of using this method, however, is that it only requires O(N) operations. The key insight is that given a set of bitstrings (like the contact maps that we are working with), calculating their n-ary similarity only requires the column-wise sum of the matrix formed by these binary vectors. With this, we can propose the following algorithm:
(1) Given a set C = {C1, C2,…,CN}, calculate the column-wise sum of all the contact maps: .
(2) For all conformations, subtract the corresponding bitstring from the total column-wise sum: .
(3) Use these vectors to calculate the complementary similarity: (Ci).
(4) Select the structure with the minimum value of (Ci).
The vector calculated in the 2nd step is all we need to calculate any n-ary index over the set C/{Ci}. Finally, note how the time-consuming steps (1st and 2nd) scale as O(N), resulting in an unprecedentedly efficient (linear) algorithm to estimate the medoid of a set of binary vectors.
(1) We begin by assigning all the objects to separate, disjoint, clusters.
(2) If in the kth step we have clusters , we proceed to the (k + 1)th iteration by combining the clusters a and b that give the maximum value of the extended similarity .
This is nothing but a natural alternative to the classical HAC algorithms, where we use the n-ary indices as a linkage criterion. However, there are some important differences with previous methods. For instance, popular single- and complete-linkage criteria measure the relationship between clusters in a simple way, using a single point-to-point distance to determine the degree of similarity between clusters. While intuitive, this is a manifestly local approach, in the sense that much of the clusters’ structure is ultimately ignored. On the other hand, by using n-ary indices we take into account all the elements at the same time, so this has the potential to provide a better measure of inter-cluster distance.
We tested this hypothesis on two different datasets on protein-DNA binding and protein G folding. For each system, we generate six distinct clusters from MELD simulations and one noisy set to evaluate the robustness of each linkage criterion. Simulation details are shown in Fig. S2 (ESI†) along with pairwise RMSD plots and representative structures for each cluster. The quality of clustering is obtained using the V-measure, in order to assess both cluster completeness and homogeneity (Fig. 3).26
As expected, using n-ary indices provides a more robust clustering criterion than the local measures. This is particularly evident for the single linkage, where not only we observe the lowest value of the V-measure, but also in the system protein G we can see a great dispersion of the possible values of this indicator. Noticeably, the average linkage also results in intermediate-to-poor V-measure values. This is interesting because it shows the limitations of the pairwise similarity metrics, because even when we try to approximate a “global” description by averaging all the pairwise distances, this is still not enough to fully capture the overall relationship between the clusters. On the other hand, the n-ary indices succeed on this task. We can corroborate this by seeing how the extended RR index consistently provides V values with small dispersion that are close to 1, similar to those obtained using the Ward criterion. This implies that the effect of maximizing the n-ary indices is similar to minimizing the variance of the data. This is in line with previous observations noting that our indices provide a robust description of set compactness. Notice, however, that the good results obtained through the Ward linkage are related to the way in which we constructed the dataset, since we used pre-selected clusters with small variance, which is directly related to Ward's objective function. A more realistic example, using a more realistic dataset with high variance, will be discussed in the next section.
Finally, another crucial advantage of the extended similarity-based HAC is that it gives us a convenient way to estimate the optimal number clusters in the data. The key here is that we can define the “cost function”, ΔS, to evaluate the cost of merging two clusters based on the change in similarity. For example, if we combine clusters a and b in the kth step, the associated cost will be given by:
(2) |
This is a very convenient expression, since we have to calculate the extended similarities anyways, so we do not need to devote any extra computational power to this task (as opposed to other methods to compute the number of clusters, like X-means clustering and information theoretic approaches).27–29 Then, whenever we see a steep variation in ΔS, we go from one iteration to the next. This means that we should not have combined those clusters. We illustrate this behaviour in Fig. 4, where, unsurprisingly, we see that in the last stages of the clustering process the cost increases significantly.
(1) Given a set of structures, find the one that is closest to the reference state. (e.g., by finding the medoid, as detailed above).
(2) For all the remaining conformations, calculate their similarity with respect to the reference state.
(3) Separate all the structures in the subsets according to their similarity with respect to the reference state.
(4) Cluster the structures in each of the previous subsets.
Overall, this novel approach of the analysis of protein folding is extremely efficient. As we have already shown, finding the native structure scales as O(N), which is also the scaling of the 2nd step. Finally, notice that we greatly reduce the burden of a traditional clustering algorithm by effectively dividing the data into “bins”, which are then studied individually. In this regard, step 3 effectively serves as a data-reduction technique. The idea of selecting a reduced set of structures (in this case, one) as a reference to compare with others in order to describe a larger set is reminiscent of the chemical satellite approach used to visualize the chemical space.30,31 However, in our case, instead of selecting an outlier, we use the medoid of the set (e.g., its most representative point) as a reference to “navigate” through the rest of the conformations.
We applied this workflow to the NuG2 dataset (see Fig. 5), where first we selected 5000 random samples from the full dataset, before proceeding through steps 1–4. In this particular case, we separated the data into four subsets in the 3rd step: folded (0.06 > csim), intermediate (0.12 > csim ≥ 0.06), pre-folded (0.16 > csim ≥ 0.12), and random coil (csim ≥ 0.16), where csim stands for the complementary similarity of each conformation. Given the homogeneity of the folded subset and, conversely, the extremely disorganized nature of the random coil conformations, we only performed clustering on the intermediate and pre-folded states.
Fig. 5 Hierarchical protein folding landscape of NuG2. After calculating the complementary similarity and binary similarity (see Fig. S5, ESI†), the data set is divided into four subsets: folded (blue), intermediate (orange), pre-folded (green) and random coil (red). Hierarchical clustering with the extended similarity linkage is performed on pre-folded and intermediate states with representatives in highly populated clusters shown at the bottom. |
In the clustering of biological ensemble datasets (and other data science applications), besides the evaluation of the centroid, a stricter assessment of the clustering process is the calculation of the relative population of different clusters. This is particularly relevant for the most populated states, which should manifest the dominating conformations of interest (e.g., in this case, intermediates in the folding process). Clustering using extended similarity indices is also superior to the standard linkage criteria in this regard when we perform clustering over the intermediate state datasets (e.g., step 4 in the previous workflow). As shown in Fig. 6, and similarly to the discussion in the previous section, single and average linkages completely failed to identify different clusters in this state, as expected. On the other hand, the complete and Ward linkages tend to create one big cluster with high variance. In particular, as the result of its objective function, the Ward linkage prefers to make the high-variance cluster more populated, likely due to the penalty to merge with the small cluster with little variance (see Fig. S4, ESI†), which leads to an erroneous representation of the ensemble's landscape. By comparing with the latest analysis of the same NuG2 folding dataset, it is remarkable that all clusters derived here correspond well with essential states along two distinct folding pathways from relaxation mode analysis (see Fig. S6, ESI†),32 using both structural and time information: after the early formation of N-terminal β-sheets, intermediate states I(a) and I(c) are identified as the most populated clusters before finally folding to the native state. In addition, I(d) was identified that corresponds to a near-native state containing a two-residue register shift in the N-terminal hairpin, which was studied further by Schwantes et al.33 In this way, the unfolded clusters have few stabilizing interactions and are internally quite diverse, the folded states are characterized by many stable internal interactions (e.g. hydrogen bonding) leading to a narrow ensemble of structures (lower entropy). In a nutshell, our strategy of pre-classifying all conformations along a density-like similarity coordinate and then performing further clustering with the extended similarity index in the regions of interest provides a robust and highly efficient way to analyse challenging biological ensembles.
(3) |
The first step is to calculate the sum of each column:
(4) |
Based on these sum values, we decide whether each of the columns can be identified as a similarity counter given by the coincidence of ‘on’ bits. To do this, we need to define a coincidence threshold, γ, which in this work we take to be equal to:
γ = nmod2 | (5) |
We then apply the following criterion: every column such that its sum, σk, satisfies the inequality:
2σk − n > γ | (6) |
Finally, the extended non-weighted Russell–Rao index, SeRR, will be given by:
(7) |
(8) |
(9) |
All of these complementary vectors correspond to the comparison of n = 4 fingerprints, so in this case, γ = nmod2 = 0. This is all we need to obtain the complementary similarity values:
(10) |
The fingerprint with the lowest complementarity similarity, F4, is the medoid of the set.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d1cp04019g |
This journal is © the Owner Societies 2022 |