Evan
Seitz
ab,
Joachim
Frank
*ab and
Peter
Schwander
*c
aDepartment of Biochemistry and Molecular Biophysics, Columbia University Medical Center, New York, NY 10032, USA. E-mail: ees2192@columbia.edu; jf2192@cumc.columbia.edu
bDepartment of Biological Sciences, Columbia University, New York, NY 10027, USA
cDepartment of Physics, University of Wisconsin-Milwaukee, Milwaukee, WI 53211, USA. E-mail: pschwan@uwm.edu
First published on 6th April 2023
ManifoldEM is an established method of geometric machine learning developed to extract information on conformational motions of molecules from their projections obtained by cryogenic electron microscopy (cryo-EM). In a previous work, in-depth analysis of the properties of manifolds obtained for simulated ground-truth data from molecules exhibiting domain motions has led to improvements of this method, as demonstrated in selected applications of single-particle cryo-EM. In the present work this analysis has been extended to investigate the properties of manifolds constructed by embedding data from synthetic models represented by atomic coordinates in motion, or three-dimensional density maps from biophysical experiments other than single-particle cryo-EM, with extensions to cryo-electron tomography and single-particle imaging with a X-ray free-electron laser. Our theoretical analysis revealed interesting relationships between all these manifolds, which can be exploited in future work.
From the numbers of sightings of the observed states in this low-dimensional space, a free-energy landscape may be obtained following a fundamental relationship of statistical mechanics.4,6,9 Together this provides a complete mapping of the system's state space while articulating its energetics topographically in the form of sprawling hills and valleys.10,11 From such a landscape, a minimum-energy path can be derived representing the most probable sequence of transitions taken by the molecular machine between any two states.12,13 And along this path, a sequence of 3D structures can be extracted for biophysical analysis, allowing the basis for molecular function to be elucidated. The ability to experimentally determine energy landscapes, in conjunction with 3D conformational movies of a molecular machine along distinct low-energy paths, opens a new horizon in structural biology and, by extension, in molecular medicine.
In general, obtaining this desired information is a difficult task, and numerous methods have been developed over the last decade to tackle this problem.5,14–16 The ManifoldEM method5,17 we have adopted is based on an unsupervised geometric machine learning approach using manifold embedding to recover the distribution and occupancies‡ of states. Its viability has been demonstrated in its application to four experimental systems—the ribosome,5 ryanodine receptor (RyR1),17 vacuolar ATPase,18 and SARS-CoV-2 spike protein.19 In each of these studies, conformational states of the respective macromolecular complex were characterized by different spatial constellations of its relatively rigid domains, and organized in a state space according to the continuous motions of each domain along a unique coordinate.
Recently, we devised a protocol using ensembles of simulated macromolecular complexes to interpret the conformational manifolds learned by two prominent dimensionality-reduction methods: principal component analysis20 (PCA) and diffusion maps21 (DM, used in ManifoldEM), with a strong focus on the latter. With the knowledge gained from this analysis, we were able to introduce several advancements to the ManifoldEM approach in a novel method called ESPER (Embedded Subspace Partitioning and Eigenfunction Realignment), which we validated using experimental data of the ribosome and RyR1 obtained via cryo-EM.22 Altogether, ESPER constitutes a substantial improvement of the original ManifoldEM framework that advances the ability to accurately and efficiently construct the free-energy landscape of macromolecular complexes from cryo-EM data.22 Our post hoc interpretation strategy, however, was only briefly summarized in that work.
The current article brings our interpretation strategy center stage to detail the analysis that proved instrumental in edifying our understanding of these complex manifold geometries. In contrast to the scope of our companion article,22 which deals solely with the properties of 2D projections of macromolecular conformers, we also now focus on their 3D structures represented either by electron density maps or sets of atomic-coordinate structures. Ultimately, we embark on an in-depth analysis of different types of synthetic-continuum datasets and their embeddings, with each “datatype” modeled so as to emulate the output of a different experiment from the same conformations of a molecular machine.
Importantly, the insights gained from our analysis of these datatypes are not limited to the field of cryo-EM, but also apply to other biophysical methods supplying information on conformational continua. In order to put the properties of these datatypes into the context of established theories, we further frame our results in comparison to a simple approximative mathematical model and derive useful explicit relationships. In the process, many striking similarities emerge between the results from various datatypes, and when differences do emerge, the information is revelatory.
Through this analysis, we are able to demonstrate the effect of different choices of continuum datatypes on the performance of geometric machine learning, which has significant implications for the reconstruction of complex information obtained from a range of widely used experimental methods. Given these results, we are able to highlight the power of manifold embedding in elucidating important biophysical properties—provided that minimum requirements are met in the data—while describing several key challenges and limitations that must be considered when analyzing datasets with multiple types of statistical uncertainties. Finally, informed by these implications, we provide a general outlook on the future of these methods within the scope of cryo-EM and beyond.
These Mn structures were then transformed into simulated 3D electron density|| maps (EDM) with a resolution of 3 Å, where each atom is represented by a 3D Gaussian with radius defined by the corresponding number of electrons.25 Using parallel line integrals along the direction of the electron beam, 2D projections of each EDM were obtained to simulate weak-phase contrast images as generated in a transmission electron microscope (TEM) operated in bright-field mode.3 These projections were generated for evenly-spaced viewing angles across the entire angular space, so that a full set of images depicting Mn conformational states of the molecule was available for each projection direction (PD). Each projection was further modified by application of the contrast transfer function (CTF), duplication to create multiple sightings per conformational state, and addition of noise to simulate realistic images of a cryo-EM experiment.
Selected ensembles of such simulated single-particle cryo-EM images were then independently embedded in a low-dimensional space—forming a manifold of intrinsic dimension n—and the resulting spectral properties26 of each embedding were analyzed and compared. Using first pristine (i.e., noiseless and aberration-free) synthetic data, we observed surprising patterns in the geometries of these embeddings. For instance, each PD-manifold followed a clearly-defined high-dimensional parabolic surface, but also included eigenfunction harmonics and other unexpected aberrations. The same geometric features were present for datasets with additive noise applied to the images, with the fidelity of these features to the pristine case dependent on the signal-to-noise ratio (SNR), the number of times τ images were generated for each state Mn (totaling Nn = τMn), and introduction of CTF.
But it was only through a much deeper investigation of the manifolds of atomic-coordinate structures and 3D electron density maps of the same conformer—which we aim to present in this article—that a complete understanding of these manifolds was obtained. Using this knowledge, we were able to determine the exact form of the eigenfunctions in each PD-manifold, including identification of each set of harmonics and high-dimensional manifold rotations.
Guided by results of these ground-truth studies on both pristine and noisy data encompassing up to three degrees of freedom, the compendium of our heuristic findings provided new insights into the origin of longstanding ManifoldEM problems, leading to the development of the ESPER method for correcting them. Along with several novel operations and refinements to the preexisting approach, the ESPER method provides a thorough rationale for properly handling the n-dimensional PD-manifold embeddings in the presence of experimentally-relevant noise, CTF and several degrees of freedom to accurately generate the free-energy landscape of a molecular machine as well as 3D movies depicting its function.
As an outline of our exposition, in Section 2.1, we use the DM method to investigate the known eigenfunctions of the LBO on the interval and rectangular domains and compare these results to the manifolds formed by a quasi-continuum of atomic-coordinate structures. Following this analysis, we detail how the structure of manifolds Ω obtained from a conformational state space transforms as the data type is changed successively from atomic-coordinate structures (ΩACS) in Section 2.2, to 3D density maps (ΩEDM) in Section 2.3, and finally to 2D projections (ΩPD) of those maps in Section 2.4. A schematic of this framework is provided in Fig. 1, showing how the Hsp90 ground-truth data are innately represented by each of these distinct datatypes.
Fig. 1 Outline of our framework for generating continuum data composed in three distinct datatypes: 3D atomic-coordinate structures (top); 3D electron density maps (middle); and 2D projections of 3D electron density maps (bottom). In the left-hand column are the respective appearance of these datatypes for the Hsp90 continuum models, showing the type and general range of motion for two conformational motions (CM1 and CM2). On the right is a corresponding schematic, showing how the continuum models are represented by the respective datatype. Specifically, a collection of 3D coordinates (top) defines the position of each atom, while a set of 3D voxels (middle)—or pixels (bottom)—defines the electron density|| within each cubic—or its projection in a square—region in space (to note, our schematic showing the projection operation along a principal coordinate is strictly for conceptual aid; in application, projections occur along any direction). |
d(a, b) = ‖a − b‖ = [(a1 − b1)2 + ⋯ (an − bn)2]1/2 | (1) |
At first glance, such a latent space seems to have little relation to the relatively complex synthetic continuum dataset of Hsp90 conformers. However, this abstraction is motivated by the representation of our ground-truth state space of atomic models, where the relationship between equispaced coordinates in the latent space matches the relationship between equiangular molecular-domain rotations. By embedding the data of the atomic-coordinate structures of Hsp90 conformers occupying SSn and comparing the resulting eigenvectors to those obtained from the embedding of the data of the latent space Xn, we will show in Section 2.2 that the results are nearly identical. We will then demonstrate in Section 2.3 and 2.4 that for the embeddings of data from the 3D electron density maps as well as their 2D projections, the relation relative to the latent space becomes distorted, which can be explained by a change of the metric in the process of switching from one data type to the other.
For the regime of small Gaussian bandwidths, a cosine series emerged for all eigenvectors (Fig. 2A), in very good agreement with the Laplacian eigenfunctions on a 1D Euclidean interval with Neumann boundary conditions. Specifically, we anticipate and retrieve canonical eigenfunctions28 of the form . As the Gaussian bandwidth was incrementally increased from ε↓ to ε↑, this cosine series smoothly transformed into a different complete, orthogonal set: the Legendre polynomials30 (Fig. 2B). However, we note that these polynomials only occur for boundary conditions of hyperrectangles, which are n-dimensional Cartesian products of orthogonal intervals.
(2) |
It is important to take notice that, alongside leading eigenfunctions in the first two rows of each subplot in Fig. 2, the leading composites of these eigenfunctions {Ψi × Ψj | i < j} have also been plotted in the rows that remain, with each composite forming a unique 2D subspace. Mathematically, each such mapping to a 2D subspace is the restriction to the N-dimensional embedding of the projection of onto ; given by {Ψ1 × Ψ2 × … × ΨN} → {Ψi × Ψj} (for expediency, we will use the term subspace to specifically refer to a subspace of an embedded manifold). Of interest, among the available subspaces, a leading parabolic trajectory exists for each degree of freedom present; for example, {Ψ1 × Ψ4} and {Ψ2 × Ψ7} in Fig. 2C, which correspond to the sequence of states along Y1 and X1, respectively. While less significant for the scope of the current section, the study and use of these 2D subspaces will be crucial in the sections to come, especially when dealing with experimental data, which we describe below.
Specifically, as the points in an experimental dataset naturally arrive in unordered sequence, one would have to properly sort the dataset indices to recognize the sinusoids shown in Fig. 2A and C; here, for example, there would be 50! sequences to consider. In the application, even if the ground-truth ordering was obtained, then in the presence of duplicate states (which we anticipate in an experiment), each sinusoid would be irregularly stretched along the horizontal axis where those duplicate states occurred, forming an unwieldy distorted sinusoidal form. However, as the points in each eigenvector are always scrambled in the same order for all eigenvectors, the composite of any two will always appear in a readily identifiable form. For example, as seen in Fig. 2A, a subset of the canonical Lissajous curves31 emerges across the 2D subspaces of each ΩPD, with the curves in this set having the form
Lp,q = {cos(pπx) × cos(qπx)|0 ≤ x ≤ 1} | (3) |
For these composites, we found that information pertaining to the given degree of freedom is portrayed most simply (without overlap) along a specific subset of L, here as seen across the set of 2D subspaces defined in pairwise combination with the leading eigenvector; e.g., {(Ψ1 × Ψ2), (Ψ1 × Ψ3), …, (Ψ1 × Ψz)}, where z is the index of the smallest non-zero eigenvalue. Specifically, this subset Tk ∈ L corresponds to the known Chebyshev polynomials of the first kind,30 of which we observed that the parabolic form is the lowest-order member present in each ΩPD embedding.
Given their significance, these 2D subspaces have several important properties worth highlighting for their eventual use (or avoidance) during interpretation. First, note that for each sinusoidal subplot in Fig. 2A, points are equispaced along the horizontal axis in correspondence with the equidistance between points in the corresponding latent space. However, as a result of taking the Cartesian product of sinusoidal eigenfunctions, only non-uniform spatial relationships exist between neighbouring states in each Lp,q; a relationship described by a non-isometric mapping29 where distances in a domain are not preserved in its codomain. As shown in Fig. 3, the spacing between points in L1,2, which is the composite of two such sinusoids, has an intrinsically nonuniform distribution. The density of points is similarly arranged as seen in the corresponding point clouds. For general reference, we denote this aspect with the term nonuniform rates of change.
Next for consideration, as seen for example in Fig. 2A, there exist several parabolic trajectories scattered throughout the 2D subspaces. However, only the first of these parabolas (here, {Ψ1 × Ψ2}) describes the full extent of the variational information present monotonically, while all other trailing parabolas (such as in the {Ψ2 × Ψ4} subspace shown) display a non-monotonic signal. As an analytical demonstration, Fig. 3 shows that the first three such parabolas can be generated via L1,2, L2,4 and L3,6. The latter two repeat the conformational information twice and three times, respectively, within one span of the parabolic trajectory.
As a consequence, only the mapping from the sinusoids to the first parabola in this set is bijective (injective and surjective),29 while all other mappings to higher-order parabolas are non-injective surjections. Importantly, since the Cartesian product of continuous functions is continuous, and so are projections from product spaces, this bijection further meets the requirements of a homeomorphism: a bijective correspondence that preserves the topological structures involved.29 We denote the higher-order parabolas (formed via the non-injective surjections) as parabolic harmonics, which do not preserve topological structure and must be avoided when determining a given degree of freedom; a problem that becomes more challenging as more degrees of freedom are added to the system.
In general, the presence of these patterns extends into latent n-spaces with n > 1. For every degree of freedom present in a state space SSn, there exists a corresponding set of Lissajous curves interspersed across specific {Ψi × Ψj} projections of the corresponding embedding. Specifically, in the case of Fig. 2C, independently projecting the data for SS2 onto the planes spanned by its {Ψ1 × Ψi} and {Ψ2 × Ψj} combinations (where i > 1; j > 2) reveal a unique set of Chebyshev polynomials, with the sequence of points along these trajectories corresponding to CM1 and CM2. For convenience, a set of Chebyshev polynomials30 corresponding to a given CM will be referred to as its conformational modes). Thus, even though the knowledge required to view these sinusoids is unavailable outside of ground-truth studies, one can rely on their existence—via the composites of carefully chosen eigenvectors—to elucidate conformational type and order.
With these details set aside, we next return to the smaller of the two Gaussian bandwidth regimes, in order to compare the previous nondegenerate rectangular results to those from a degenerate square domain, with N = 50 × 50 points equispaced identically along X and Y (Fig. 4A). Due to the presence of degenerate eigenvalues, which can arise for domains with a rational ratio , we encounter pairs of eigenfunctions that appear different from the nondegenerate case of the rectangle.28 This can be seen, for example, by eigenvector pairs {Ψ1, Ψ2} and {Ψ4, Ψ5} in Fig. 4A. In Fig. 4C, we illustrate that these eigenfunctions are just rotated within their degenerate space, exactly as expected. We note that an eigenfunction associated with a degenerate eigenvalue is a linear combination of the degenerate eigenfunctions,28 where the normalization of the eigenfunctions restricts this linear transformation to a rotation and reflection (i.e., the group of orthogonal transformations). For example, the {Ψ1, Ψ2} eigenvector pair is of form Ψ′ = RTΨ such that
(4) |
Fig. 4 DM eigenvectors of the degenerate latent 2-space. DM eigenvectors of the N = 50 × 50 square domain for small Gaussian bandwidth (ε↓ = 5 × 10−5) are shown in (a) and (b) before and after high-dimensional rotations, respectively. It can be seen here that pairs of eigenfunctions exist that contain relationships aberrant to the canonical eigenfunction form seen in Fig. 2C. Two such pairs have been highlighted in red and blue, respectively, with the members of each pair always rotated 90° apart. To note, as any rotation can happen in the presence of degeneracy, this initial rotation is an arbitrary one. We demonstrate this property via the schematic in (c), which shows the angular relationship between two analytically-generated functions (cos(πx) and cos(πy), each displayed in the reference frame of states in X1) as they are jointly rotated 90°. By applying rotation operators R1,2(θ) = 45° and R4,5(θ) = 45° independently to two such aberrant pairs in (a), the canonical eigenfunction form begins to recover in (b), and more so as additional operators are appropriately applied. |
As seen for eigenvector Ψ6 in Fig. 4A, these summands can also have the form of two products: Ψ6 = β1cos(πx) cos(2πy) + β2cos(2πx) cos(πy), with any β1 and β2 such that β12 + β22 ≠ 0. Hence, it can be seen that these aberrant eigenfunction pairs are defined by an admixture of cosines in a higher-dimensional space, with form
Ψi = β1cos(vπx) cos(wπy) + β2cos(wπx) cos(vπy) = β1ψvw + β2ψwv | (5) |
(6) |
While degeneracy is a coincidence which can be directly identified from the eigenvalue spectrum, a similarly rotated appearance (i.e., eigenfunction misalignment) will later turn up during our investigation of PD manifolds. Pairs of misaligned eigenfunctions, at least approximately, can also appear when domains have undergone certain elementary geometric transformations. For example, by performing an affine transformation on a rectangle ΩR to form a parallelogram ΩP, we observed a rotation of the first two eigenvectors, as similarly seen in Fig. 4A. Recall that an affine mapping preserves collinearity and ratios of distances, but in general not distances and angles.32 In Section 2.3, we will explore the possibility of other classes of transformations.
As a final point in this section, we illustrate our method for retrieving the canonical eigenfunctions buried within an embedding, which has been used in Fig. 2 and 4, and extensively throughout the remainder of this work. Fig. 5 provides a schematic using the known analytical eigenfunctions chosen so as to match the results from DM on the square (degenerate) ΩR. As shown, a given sequence corresponding to the ground-truth arrangement of points along each degree of freedom (here X1 or Y1) captures the eigenfunction on a projected plane in the n-dimensional space where it resides.
Fig. 5 Intuition for sequential ordering of eigenfunctions based on ground truth. Subplots are analytically generated so as to match the appearance of Ψ1 in Fig. 4A. For this presentation, the equation Ψ1 = cos(θ) cos(πx) + sin(θ) cos(πy) was used with θ = 45°. As seen in the largest subplot, the eigenfunction exists in an n-dimensional space defined by the n degrees of freedom of the system. By displaying points in sequence corresponding to a known degree of freedom, here X or Y, we are effectively viewing each eigenfunction on a projected plane in its n-dimensional space. |
(7) |
The resulting DM embeddings for the small and large Gaussian bandwidth regimes, which are shown in Fig. 6A and 6B, respectively, share a strong resemblance with those found for the latent space (Fig. 2). Again, we note the presence of cosine eigenfunctions for the small Gaussian bandwidth regime, and a nearly-perfect linear form (via leading Legendre polynomials) in the large Gaussian bandwidth regime; i.e., {Ψ1, Ψ2} in Fig. 6B. For the latter, we will show that such a linear form cannot be obtained in the other data types, 3D EDMs and 2D PDs, to be explored. In the small Gaussian bandwidth regime, we can identify both CM1 and CM2 parabolas residing in the subspaces {Ψ1 × Ψ3} and {Ψ2 × Ψ8}, respectively. Similar results—albeit for different dimensions—were found for the ΩACS embeddings from SS1 and SS3.
Briefly, we note that the results of PCA on this same dataset most closely resembles the geometries observed for DM in the large Gaussian bandwidth regime (in this instance, Legendre-like), with a slightly less uniform distribution. This suggests that DM better approximates intrinsic relationships in the data, as expected. A general similarity between the geometries observed using PCA and DM in the ε↑ regime will continue to manifest throughout the remaining sections of this article.
The striking similarity between the eigenfunctions of the latent space and the eigenfunctions of the atomic models can be rationalized as follows. Provided that the range of a single-body rotation is moderate (≲30°), the distance Dij between any two states i and j within this range is to a good approximation , where rk is the distance of atom k away from the rotation axis, m the number of atoms of the body, and the angular difference between the states. Therefore, Dij is to a good approximation directly proportional to . If there are multiple independent body rotations (i.e., CMs) present, the individual distances add in quadrature as in a Euclidean space. While not further discussed in this article, the linearity also holds for rigid body translations, where the distance is precisely proportional to the magnitude of the translation. Thus, the agreement between the eigenfunctions of the latent space and the ones of the atomic models is a direct consequence of the approximately linear relationship between distance and extents of multi-body motions (i.e., moderate rotations and translations of any magnitude).
However, there are a few attributes to consider that distinguish the manifolds obtained for EDMs from those retrieved for the previous data types. First, the difference between small and large Gaussian bandwidth regimes is much less drastic, such that only the cosine eigenfunctions appeared in both regimes. For small Gaussian bandwidth regimes (i.e., a few orders of magnitude below the optimal value determined by the bandwidth estimation method), we found that the leading CM2 eigenfunctions are buried deeply in low-ranking eigenvectors (e.g., Ψ8 and higher), with numerous CM1 eigenfunctions occupying the eigenvectors in between. In addition, eigenvectors with cross terms Ψi = {ψvw | v,w ≠ 0} were scattered mostly in mid-range positions (e.g., Ψ12 and higher).
In contrast, for larger Gaussian bandwidth regimes (i.e., near and significantly above ), eigenvectors with cross-terms are buried in much deeper subspaces (e.g., Ψ34 and deeper), with the majority of leading eigenvectors housing content exclusively for either CM1 (w = 0) or CM2 (v = 0). These CM eigenfunctions also have a near-perfect distribution of points, whereas for the ε↓ regime, the distribution of points has noticeably less precision to the ideal form. Notably, the embeddings obtained above and below these regimes are incoherent in form.
We conclude that the eigenfunctions obtained from the larger Gaussian bandwidth regime would be preferred for several reasons. First, the desired CM1 and CM2 parabolas occupy leading subspaces and are thus easily identifiable. The paucity of leading cross-term eigenfunctions is also convenient, since they provide no useful information for our analysis while also obfuscating our search for desired subspaces. Additionally, the geometric structure of all subspaces obtained via ε↑ consistently appears much closer to the canonical form. In Fig. 7, we display the DM eigenfunctions obtained from this regime for the 20 × 20 = 400 EDMs occupying SS2. Subspaces indexed in the CM1 reference frame (row one on the left in Fig. 7) and the CM2 reference frame (row two, left) are displayed, as well as a set of leading composites of these eigenfunctions—forming 2D subspaces—on the right-hand side of the figure.
Importantly, as there was no Gaussian bandwidth value that could ‘recover’ the preferred Legendre-like form, it appears that this feature is lost upon transformation from atomic models to EDMs, caused by a change in the metric (to note, the curved geometry formed by cosines was also in close agreement with the results of applying PCA on this same dataset). As a main agent for this distinction, the distance measure pertaining to EDMs is fundamentally different from the one for the 3D atomic coordinates. Instead of the 3D coordinate points that stand for the positions of atoms in each structure, the data for each EDM is represented by a 3D array of values, one at each voxel (Fig. 1). A key difference, then, is that in the latter case the displacement of atoms is no longer accounted for individually. Instead, every voxel in the 3D map of one state is now compared to a voxel at the same location in another state, with only changes in the value at each voxel entering the distance measure.
Hence, while the eigenfunctions are similar, the relationship between states in these two data types is fundamentally different. To demonstrate this change, Fig. 8 shows a comparison of the pairwise distances between states as calculated for the rectangular latent space, atomic-coordinate structures, and EDMs. As noted in the caption, by assessment of the close similarity between the distances from the latent space and atomic models, we can infer that these two data types are both confined to the rectangular manifold ΩR (albeit of different sizes). More precisely, the metric space corresponding to ACS conformers in SSn and to elements in Xn are very similar even though the two datatypes are very different. Since only the metric enters the LBO, their eigenfunctions are nearly identical.
In contrast, we see that the distances from the EDMs are starkly different from the rectangular pattern, where neighboring states are spatially arranged via an asymptotic-like trend. From these findings, we must infer that the corresponding data ‘live’ in an altogether different manifold. Although the explicit geometric form of ΩEDM is unknown, we have shown that the leading eigenfunctions of the Laplacian on ΩEDM are essentially preserved via the mapping from the latent space. While detailed knowledge of ΩEDM is certainly of interest, it is inconsequential here since our analysis only requires an understanding of the leading eigenfunctions of a manifold.
The overall difference between eigenfunctions obtained via ε↓ and ε↑ is also much more relevant for PDs than for the EDMs. In the small Gaussian bandwidth regime, CM2 subspaces exhibit a severely suboptimal point distribution, such that in some PDs, identification of the CM2 parabola-housing 2D subspace is completely obstructed. These CM2 subspaces are also buried in trailing eigenvectors and interspersed among those with cross-terms. We also note that the value determined by the bandwidth estimation method () provides similar spectral properties, making it a suboptimal choice for pristine data. In contrast, the large Gaussian bandwidth regime (i.e., one order of magnitude larger than and spanning numerous orders of magnitude above it) proves superior in every sense, with CM1 and CM2 eigenfunctions having ideal point distributions and corresponding subspaces occupying leading eigenvectors. And again, the cross-term eigenfunctions are present only in far trailing eigenvectors (e.g., Ψ31 and higher), and would thus not be obstructive during the analysis.**
Regardless, for either Gaussian bandwidth regime, we found that significant eigenfunction misalignments can emerge—with varying magnitude—depending on the direction of projection. The eigenfunctions of an example PD, determined with an appropriate choice of Gaussian bandwidth, are shown in Fig. 9. By plotting the points in each eigenvector in the specific ground-truth sequence constructed for CM1 against a uniform index (for the M2 states in SS2 indexed via {1, 2, 3, …, 400}), a similar display of sinusoids and grid-like patterns emerges as seen in Fig. 7, albeit now misaligned with the eigenvector basis (to note, PCA returned similar spectral features for all PD datasets explored, as demonstrated extensively in our companion article).22
Since we have previously shown that no such property is apparent in the manifold embeddings generated from the 3D EDMs from which these PDs originate, this suggests that the emergence of these eigenfunction misalignments must be tied to a changed metric in the PD manifolds. Specifically, as different 2D projections are taken of the EDMs via p: ΩEDM → ΩPD, the geometry of ΩEDM can become contorted due to the change of pairwise interatomic distances resulting from foreshortening in projection (Fig. 13 in ESI†), such that the apparent span of one CM to another depends on PD. We next describe these abnormalities in greater detail, before quantifying them exactly.
First, it is unclear what effect these eigenfunction misalignments may have on interpretability, since the frames of reference used so far (e.g., as demonstrated in Fig. 9) are unavailable without a priori knowledge. To understand the effect of these eigenfunction misalignments on the PD manifolds, a collection of ΩPD embeddings are next analyzed in greater detail and introduced in sequence of increasing intrinsic dimensionality. Notably, the explicit expressions derived in the previous sections will be used to account for the geometric structures observed in each PD embedding, which generally describe perturbations of a hypersurface spanned by the quasi-continuous conformations.
For the following analysis, sets of pristine images in five chosen PDs are obtained. The first PD is chosen to be normal to the plane of the CM1 rotation, such that all CM1 motions from that perspective only produce changes in the plane of the projection. A similar choice is made for PD2, which is projected onto the plane of CM2 motions (CM2 only exercised for n > 1), with the remaining three PDs chosen along arbitrary directions. For each state space, a set of these Mn images is generated from one of these five PDs to form an n-manifold ΩPD existing in a high-dimensional space (as a specific example, for SS1, there are five PD manifolds with 20 images each, while for SS2, a separate set of five PD manifolds are formed with 400 images each). We next embed data via the DM method with Euclidean metric and, using the eigenfunctions of the LBO, analytically quantify the trajectories of our simulated conformational changes as embodied by the spectral geometry of each ΩPD.
(8) |
Fig. 10 Analysis of eigenfunctions for PD1 in SS1. On the left are the sinusoidal forms ψk that emerge when points—corresponding to images—in each eigenvector are ordered precisely in the sequence CM1 ∈ [1, 20] in which their ground-truth images were constructed. Regardless of any knowledge of such a sequence, the composites of these eigenvectors will always form well-defined geometries via the Lissajous curves, as shown on the right. In the first row on the right are the Chebyshev polynomials of the first kind, of which the parabola {Ψ1 × Ψ2} is the simplest mapping of the conformational information present. Figure source: Institute of Electrical and Electronics Engineers (IEEE) article22 (with cosmetic alterations) under licence CC BY 4.0. |
We next compared these sets of 2D subspaces among the five PDs, and found only subtle differences in the distribution of their point clouds. It is important to underscore here the natural discrepancies between different PD manifolds that should be expected—due to what we term PD disparity—which will continue to manifest in several significant forms throughout this analysis. Naturally, as each 2D projection provides an incomplete representation of the underlying 3D density map, depending on the type of motion and its component on the PD under investigation, ground truth is preserved to different degrees. This disparity affects all ΩPD characteristics and will become more relevant as we investigate the embeddings of datasets generated from structures with multiple degrees of freedom.
As a demonstration of this property, we next reordered the indices of points within all eigenvectors to instead correspond to the specific ground-truth sequence constructed for CM2 (i.e., {1, 21, 41, …, 381}, …, {20, 40, 60, …, 400}). The output of this operation can be seen in Fig. 9 on the right, which manifests a new subset of interspersed sinusoids, with increasing independent from the previous subset; and inhabiting only those eigenvectors in the complement of the CM1 subset. By induction—based on these observations and those in Section 2.1—we conclude that for n degrees of freedom in a given ΩPD, there are n independent sets of sinusoids ψk. Each set , denoted by an index γ per degree of freedom, is interspersed throughout the collection of available eigenvectors {Ψi | i ∈ N}.
Also, in agreement with Section 2.1, we found that for every CM present in the state space, there exists a corresponding set of Chebyshev polynomials. Specifically, in the case of PD1, independently projecting the data for SS2 onto the planes spanned by its {Ψ1 × Ψi} and {Ψ2 × Ψj} combinations (where i > 1; j > 2) revealed a unique set of these polynomials, with the sequence of points along these trajectories corresponding to CM1 and CM2 (Fig. 11). With this knowledge in hand, we next compared the subset of eigenfunctions as obtained in either the reference frame of CM1 or CM2 (Fig. 9, left or right-hand side, respectively) with the Chebyshev polynomials in Fig. 11.
Fig. 11 A subset of the space of 2D subspaces for PD1 in SS2. As demarcated in red and blue boxes, a set of conformational modes exists for both CM1 ∈ [1,400] (red boxes, {Ψ1 × Ψi}) and CM2 ∈ [1,400] (blue boxes, {Ψ2 × Ψj}); where i > 1 and j > 2), interspersed throughout each row. Additionally, note the occurrence of the first parabolic harmonic for CM2 located at {Ψ3 × Ψ6}. Similar plots for the remaining four PDs are provided in (Fig. 14 in ESI†). Figure source: Institute of Electrical and Electronics Engineers (IEEE) article22 (with cosmetic alterations) under licence CC BY 4.0. |
Indeed, each of the Chebyshev polynomials mapping CM1 information in Fig. 11 (visualized with subplots enclosed by blue boxes) corresponds to the subset of sinusoidal eigenfunctions that emerged in the reference frame of CM1 in Fig. 9A; with similar relations holding for CM2 in Fig. 9B. Combining this empirically-obtained knowledge with our a priori understanding of the eigenfunctions of the LBO on known domains as understood throughout Section 2, we are able to express the analytical form of these ΩPD eigenfunctions. In close approximation, we found that the leading ΩPD eigenfunctions appear in the form
Ψi = cos(θ) cos(vπx) + sin(θ) cos(wπy) = β1ψv + β2ψw, | (9) |
Fig. 12 Comparison of analytically-generated functions with heuristic results previously obtained for PD1. For each pair of subplots, values for θ were approximated by eye. Our approximations in (a) and (b) share a remarkable similarity with heuristic results shown in Fig. 9 and 11, respectively, and are able to account for geometric minutiae previously unaccounted for, as well as larger-scale rotations seen in the composite of eigenfunctions. For example, discrepancies can be seen in the slightly tilted appearance of the geometry in Ψ3 of Fig. 9, and corresponding curling at the edges of {Ψ2, Ψ3} in Fig. 11. These differences can be understood as additional, small-scale perturbations which are currently unaccounted for in our general expression. |
As also demonstrated in our analysis of eigenfunction misalignments in Section 2.1, the sum of these squared coefficients is conserved across pairs of eigenvectors, such that the base functions Ψ′i = ψv and Ψ′j = ψw can be expressed as a rotation Ψ = RT Ψ′, having the form
(10) |
This analytical expression suggests that conformational information—pertaining to each of the system's degrees of freedom—will lie on some linear combination of the embedded manifold's orthogonal eigenvectors. This feature is seen most strikingly in {Ψ3 × Ψ4} of Fig. 14B,† where the parabolic surface described by the Chebyshev polynomial is significantly out of alignment with the plane of the 2D subspace containing it. Similar instances, albeit in more subtle form, also arise for surfaces in the remaining three PDs of Fig. 14† (to note, we have also included a brief analysis of state space 3 with similar findings; see Fig. 15 in ESI†). In Section 2, we demonstrated that the need for eigenfunction realignment is due to the change in apparent interatomic distances dependent on projection direction, as illustrated via Fig. 13 in ESI.† This disparity among PDs is inevitable and poses a fundamental challenge, which we have addressed using ESPER in our companion article.22
However, one can still imagine all sorts of other situations, such as a system where one domain blocks—via steric hindrance—another domain from its full range of motion in a specific region of the state space. Additionally to consider are state spaces with “holes” (i.e., interior boundaries),28 where the occurrence of certain states is forbidden due to energetic restraints. Indeed, the presence of complex physical constraints can drastically change the boundary shape of the manifold's domain, and consequently its eigenfunctions. Less fundamentally restricted, similar aberrations can occur due to lack of coverage—as obtained by experimental measurements—of the state space. For example, obtaining poorly sampled data from a rectangular domain would allow any number of arbitrary shapes to emerge.
To better understand the effects of these boundary challenges within the scope of our heuristic analysis, we have created a 2D state space with an octagonal domain; noting that, in this case, analytical solutions of the Laplacian eigenfunctions are not available. The octagonal boundaries were created by eliminating states at the four corners of our standard rectangular domain, as shown in Fig. 16A in ESI.† To circumvent potential occurrences of eigenfunction misalignments due to PD disparity—which may complicate the interpretation of the boundary influence—we opted to embed the 3D electron density maps instead of 2D projections for this analysis. The corresponding manifold embedding obtained from this octagonal state space is shown in Fig. 16C,† which features a number of deviations from the canonical rectangular eigenbasis (Fig. 7). Manually, we attempted to find a transformation from the octagonally-derived eigenbasis (Fig. 16C†) to the rectangular form (Fig. 7) by assuming a collection of suitable rotation operators.†† Indeed, we are able to show that such a transformation is possible, up to some level of uncertainty (Fig. 16D†).
During this exposition, we have laid out a framework for generating and analyzing ground-truth data across unique datatypes with detailed guidelines on how to extend our current scope to different experimental approaches. For example, these practices could be applied with minimal alterations to studies of conformational continuum tailored to single-particle imaging with an X-ray free-electron laser (XFEL),35 nuclear magnetic resonance spectroscopy (NMR),36 and time-resolved (TR) cryo-EM.37 Generally, we believe that there is a potential for the application of our methodology to a wide area of research, particularly where biological systems exercise multiple degrees of freedom in a continuous manner. Our companion article provides one such direct application of these findings to the study of single-particle cryo-EM data, where we describe a method for navigating the spectral geometry of 2D projection data in the presence of experimentally-relevant SNR and CTF to recover a quasi-continuum of 3D structures and corresponding free-energy landscape.22 Beyond the potential for additional applications, a comprehensive understanding of how the eigenfunction basis of a manifold changes dependent solely on datatype is an area of research with wide implications for machine learning across a diversity of fields.
More technically, we next summarize key assumptions of this framework that pertain to the relevancy and breadth of our heuristic analysis of structural heterogeneity. Of first consideration, we have focused here on situations where structures undergo collective rigid-body motions, which we believe are sufficient for the description of most molecular machines, but may fall short of addressing instances involving more complex situations. Examples are studies of machines entailing the concerted binding and release of ligands, which naturally require a separate state space for each possible combination of the machine with its binding partners. For such a situation, a similar heuristic analysis could be conducted using synthetic models combining two or more state spaces.
In our previous work,22 we further tested the ability of PCA and DM to correctly embed PD manifolds formed from models exercising more complex domain motions using the so-called “mouth-wing” toy model. Notably, the spheres making up the “mouth” domain in this toy model were uniquely positioned (keyframed) for each state so as to gradually clump together (ultimately presenting a higher density towards the fully-open state), while the “wings” were programmed to open in unison at constant angle increment about their hinge and curl inwards at the same time. Compared to the synthetic framework used to generate the Hsp90 dataset, the construction of mouth-wing data provides a radically different situation, and accounts for complex interactions of spheres within each domain motion, while still obeying conservation of mass. The embedding of the mouth-wing images still manifested all essential spectral characteristics detailed for Hsp90. When random positional noise was applied to each sphere across all mouth-wing models, changes to the spectral geometry were effectively no different than from applying additive Gaussian noise on the rendered image of each model. Although, with the addition of the mouth-wings model, we fall short of an exhaustive coverage of possible data models, we believe the correspondence between its outputs and those of the Hsp90 dataset created with entirely different design principles suggests the generality of our discoveries.
Finally, we will discuss the challenges of analyzing systems with complex physical constraints. For the majority of this study, we have been interested in casting a wide enough net so as to capture the dynamics of a large portion of conformer systems, which we surmise operate within rectangular boundaries of an n-dimensional latent space of relatively-rigid multi-body motions. However, as described in Section 4, one can still imagine other situations where the boundaries could admit any arbitrary shape. As an example, we demonstrated how the eigenfunctions change when our ground-truth data is constrained by octagonal boundary conditions (Fig. 16†). Furthermore, there exists a potential for complex relationships that render the manifold noncompact.29
In general, analytically solving the Laplacian for any arbitrary boundary is impossible, since analytical solutions are only available for certain elementary shapes such as rectangles, discs, ellipses and special triangles.28,38 On the other hand, spectral methods such as diffusion maps can efficiently obtain numerical solutions, in principle for any boundary. However, this could present additional challenges, since numerical solutions may be harder to incorporate in the process than analytical forms. Depending on the system under study, the prevalence of non-trivial boundaries inducing eigenfunction dissimilarity may vary, with a more detailed analysis required in each case.
Still, when studying systems described by a compact manifold, it is normally expected that the hypercube will continue to ‘fill in’ as more data are acquired from the experiment (or simulation in the case of MD). As previously addressed in Section 4, we contend that in the case of conformers of molecular machines, non-trivial boundaries are the exception rather than the rule. When available, it is most lucrative then for experimentalists to aspire towards acquiring this most complete information. In this way, they will maximize the ability to track conformational changes in their analysis.
Footnotes |
† Electronic supplementary information (ESI) available: Data and code repository. See DOI: https://doi.org/10.1039/d2dd00128d |
‡ We use the term “occupancy” throughout the manuscript, which corresponds to the number of states in conformational space. It is related to the conformational density of states ρ(C), so that the number of states in a small interval [C ± ΔC] is proportional to ρ(C) × ΔC. |
§ In the literature, there is a wide range of nomenclature used here among fields and, in some instances, works by the same authors. For clarity, the following terms are interchangeable: conformational motions (CMs); conformational coordinates (CCs); reaction coordinates (RCs); collective motion coordinates. |
¶ A tabulated description of symbols and abbreviations used throughout this document is available in ESI.† |
|| In reality, cryo-EM data (under the regime of weak phase contrast in bright-field transmission electron microscopy) are projections of the Coulomb potential distribution, which is distinct from the electron density distribution determined in X-ray crystallography. For the present analysis, however, this distinction is irrelevant. |
** Upon introduction of noise (SNR of 0.1) and τ = 5 noisy duplicates of each state, becomes the most suitable choice (along with numerous orders of magnitude above it), with anything below this range completely inadequate. |
†† We note that both the indices and number of rotation operators required for this transformation deviated from our PD-manifold findings on eigenfunction realignment performed on rectangular state spaces,22 creating a more complex tree of decisions. |
This journal is © The Royal Society of Chemistry 2023 |