Xiaohan Lin‡
a,
Zhenyu Chen‡a,
Yanheng Li‡a,
Zicheng Mabc,
Chuanliu Fand,
Ziqiang Caod,
Shihao Feng*b,
Jun Zhang*b and
Yi Qin Gao*ab
aBeijing National Laboratory for Molecular Sciences, College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, China. E-mail: gaoyq@pku.edu.cn
bChangping Laboratory, Beijing 102200, China. E-mail: fengsh@cpl.ac.cn; jzhang@cpl.ac.cn
cAcademy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China
dInstitute of Artificial Intelligence, Soochow University, Suzhou 215006, China
First published on 15th May 2025
Modern protein engineering demands integrated sequence–structure representations to tackle key challenges in designing, modifying, and evolving proteins for specific functions. While sequence-based methods are promising for generating novel proteins, incorporating structure-oriented information improves the success rate and helps target corresponding functions. Therefore, rather than relying solely on sequence or structure-based approaches, a consensus strategy is essential. Here, we introduce ProTokens, machine-learned “amino acids” derived from structural databases via self-supervised learning, providing a compact yet information-rich representation that bridges sequence and structure modalities. Instead of treating sequences and structures separately, we build PT-DiT, a multimodal diffusion transformer-based model that integrates both into a unified representation, enabling protein engineering in a joint sequence–structure space, streamlining the design process and facilitating the efficient encoding of 3D folds, contextual protein design, sampling of metastable states, and directed evolution for diverse objectives. Therefore, as a unified solution for in silico protein engineering, PT-DiT leverages sequence and structure insights to realize functional protein design.
Although structure prediction models like AlphaFold17–19 and RoseTTAFold20 have demonstrated their ability to map sequences to structures, the quality of their folding largely relies on homologous sequences and structural templates. On the other hand, inverse folding models such as ProteinMPNN6 can design sequences that stabilize and accommodate given backbone structures, but their generalizability to rare and novel folds remains to be tested. Consequently, the modality difference of protein representations causes significant divergence in the research paradigms of proteins, particularly in the realm of protein design. Thus, bridging the gap between sequence and structure modalities remains a significant challenge and limits applications such as the generation of diverse and novel de novo proteins.13,21
Aiming to provide a unified perspective on protein sequences and structures, we developed here ProTokens. Conceptually, ProTokens represent a novel set of “amino acids” learned and extracted from protein structure databases through self-supervised learning. As machine-learned amino acids, ProTokens offer compact informative representations of protein structures. They possess the technical advantages of sequence representations, being compact for storage and convenient for use as input/output in computational models. Furthermore, through a sophisticatedly designed training strategy, ProTokens are as informative as foldable 3D protein structures. Protein structures are efficiently compressed into sequences of ProTokens, which can then be accurately and reliably “folded” back into their corresponding structures.
By harnessing the computational convenience of sequence-like representations and the functional relevance of structure-awareness, ProTokens naturally bridge the gap between sequence-based and structure-based methodologies in protein engineering. By combining natural amino acids with ProTokens, we train a diffusion transformer22-based model PT-DiT, to model the joint probability of protein sequences and structures. Leveraging this generative objective, we found PT-DiT to be a versatile tool for protein engineering at both the sequence and structure levels, enabling design with residue-wise protein contexts for tasks such as contextual inverse folding and functional site scaffolding. Similar to protein language models, generative pre-training of PT-DiT yields a powerful latent representation that jointly and faithfully embeds sequences and structures. Utilizing zero-shot or few-shot learning on this representation, PT-DiT's capabilities extend to sampling metastable states in protein dynamics, rediscovering naturally occurring or de novo “evolutionary” intermediates in remote homologs, and directing the evolution of proteins towards specific objectives.
ProTokens are learned via a vector-quantized autoencoding23 framework trained on a backbone reconstruction task using 3D structural data from metastable conformations of foldable proteins (Fig. 1a). Each residue in a given structure is mapped to a discrete ProToken selected from a constrained codebook, capturing residue-level structural features. Notably, the input to the encoder consists solely of backbone atom coordinates (N, CA, C, O), ensuring that the ProToken representation encodes sequence-agnostic geometric information. Unlike naturally occurring amino acids, ProTokens can be readily decoded into atomistic coordinates. During training, ProTokens that capture different local and global structural features remain distinguishable, with the training objective driving ProTokens representing similar conformations closer together in latent space. This strategy yields a more nuanced representation of conformational ensembles. Once trained, ProTokens enable efficient compression, storage, alignment, and comparison of protein structures.
PT-DiT aims to engineer protein sequences, structures, or both from a unified generative perspective, regardless of whether sequence or structural information is initially available (Fig. 1b). Built upon a pretrained twin-tower embedding model, we use ProTokens (structure-aware embeddings) and natural amino acid embeddings as dual-channel representations of proteins. The structure embeddings are derived from protein backbone structures via the ProToken encoder, while the sequence embeddings are computed using a PCA-compressed of AlphaFold2 (ref. 18) or ESMFold12 embeddings. The ProToken embeddings and sequence embeddings are concatenated into a joint representation for each residue, which serves as the input to train a diffusion-based generative model, PT-DiT (see Methods). Drawing inspiration from image diffusion methods such as RePaint24 and probability flow ordinary differential equations,25 PT-DiT models the joint distribution of protein sequences and structures, enabling a wide range of tasks including de novo design, scaffolding, metastable state sampling, and directed evolution. In contrast to traditional approaches that treat sequences and structures as separate modalities19,26 linked only by computational folding or inverse folding, PT-DiT integrates both into a unified representation. This integration enables protein engineering in a joint sequence–structure space, streamlining the design process and facilitating the derivation of specialized functions.
We introduce a probabilistic framework to clarify the methodology underlying structure-informed representations (see the ESI†), establishing the basis for the ProToken and PT-DiT algorithms. To ensure uniqueness, compactness, and sufficiency of the ProToken code, we incorporate alignment and uniformity loss functions27 (Fig. S1†). A test set was curated for the reconstruction task, enabling evaluation of our protein tokenization module's performance and generalizability using structures from CASP1428 and the RCSB database,29 and AFDB30 dark clusters identified by Foldseek.26 TM-score31 and LDDT32 were computed between the reconstructed and original structures (see Methods) to benchmark performance. We subsequently demonstrate PT-DiT's feasibility for traditional tasks such as inverse folding and contextual backbone design. We then comprehensively benchmark PT-DiT's performance across three main applications: (a) metastable conformation sampling, validated by molecular dynamics simulations and experimental data; (b) de novo protein sequence design via latent space interpolation between remote homologs; and (c) directed evolution using a state-of-the-art active learning pipeline, benchmarked with EVOLVEpro.33
![]() | ||
Fig. 2 Performance of ProTokens on reconstruction test datasets. (a) The rTM-score distribution of ProToken reconstruction across 513, 87, 45, and 33![]() ![]() |
We further validated ProToken's reconstruction performance on disconnected domain assemblies from CASP14 and CASP15, as well as multimers from the AF2Complex36 benchmark sets. Notably, multi-domain and multimer folds were not included during training. The median rTM-scores for the CASP14 and CASP15 domain assemblies are 0.96 and 0.926, respectively (Fig. S2D†), and the median rTM-score for multimers is 0.891 (Fig. S2E and F†). These findings indicate that ProTokens can robustly generalize and accurately reconstruct a wide variety of protein structures, including challenging domain assemblies and multimers.
To further evaluate ProTokens' capacity to represent diverse protein conformations, we assessed reconstruction performance across proteins of varying shapes (defined based on the radius of gyration) and sequence lengths in four test sets (Fig. 2b). Reconstruction quality remained consistent across different lengths and shapes (Fig. S2A–C†), except in cases of extreme conformations such as highly extended, near-linear forms (Fig. 2c). We also examined ProTokens' reconstruction performance for seven secondary structure classes. The median reconstruction LDDT (rLDDT) exceeded 0.90 for all classes, with ‘alpha helix’ and ‘strand’ exceeding 0.95 (Fig. 2d). These findings indicate that ProTokens do not preferentially favor or discriminate against particular folds or secondary structure features, whether global or local, making it a more generalizable and reliable tool to treat different protein folds or local environments in downstream tasks.
Next, we assessed ProToken's ability to capture alternative conformations by reconstructing proteins with multiple states from 50 PDBFlex37 clusters that span local RMSDs of 2.0 Å to 53 Å. The results indicate a median rTM-score of 0.98 across all conformations, with a minimum of 0.88, demonstrating that the structural differences before and after reconstruction are preserved with high fidelity (Fig. 2e). These results suggest that ProTokens are distance-preserving for protein conformations, and therefore they can distinguish different conformations with high resolution. Although the training set does not contain explicit examples of identical sequences with different metastable structures (details in the ESI†), the ability to distinguish such conformations emerges during training, driven by the structural diversity present in the dataset and the discriminative capacity of the ProToken encoder.
Because distinct conformations of the same protein sequence map to different ProToken sequences, we examined how ProToken representations correspond to protein structures. Specifically, we computed a BLOSUM38 (blocks dubstitution matrix) for ProTokens based on residue pairs with similar local environments (details in Methods and the ESI†). We then compared these BLOSUM values with the cosine similarity of each pair of ProToken embeddings (Fig. 2f). An average Spearman's correlation coefficient of 0.62 indicates a close connection between the ProToken representation space and local structural environments.
Inspired by Foldseek26 and MMseqs,39 we implemented a ProToken-based BLOSUM using the Needleman–Wunsch algorithm40 to derive the PT-score. We randomly sampled three dark cluster centers of AFDB. For each center, we gather all the similar structures in the RCSB database using Foldseek, which results in 42 proteins in total. We then computed both TM-scores and ProToken similarity scores (PT-scores) for each structure pair (Fig. 2g). Higher PT-scores correlate with higher TM-scores, indicating that the PT-score not only differentiates similar from dissimilar structures but also captures quantitative differences relative to a reference. Thus, the PT-score serves as a promising metric for structure searching, clustering, and analysis algorithms, particularly in applications such as studying long-range allosteric interactions41 and protein pocket identification.42
To illustrate PT-DiT's co-generation capabilities, we showcase three representative de novo generation cases featuring distinct geometries: helix bundles (Fig. 3a), β-barrels (Fig. 3b), and other complex folds (Fig. 3c). The TM-scores comparing the ESMFold-predicted structures (derived from the generated sequences) to the generated structures (scTM-scores in Fig. 3a–c) are 0.94, 0.88, and 0.85, respectively. These examples highlight PT-DiT's ability to concurrently design protein sequences and structures with diverse geometries, while preserving self-consistency between each generated sequence–structure pair.
Similar to training-free image editing, PT-DiT can perform “inpainting” on masked ProTokens or sequence embeddings by leveraging the unmasked regions (Fig. 3d). One notable application is contextual backbone design. As a proof of concept, we employed PT-DiT for CDR (complementary determining region) grafting (Fig. 3e). Given a specified CDR3 sequence and structure, PT-DiT simultaneously generates both antibody sequences and their corresponding structures. Validation with ESMFold reveals one generated sequence–structure pair that achieves a scTM-score of 0.82 and a sequence identity of 0.84 compared to a known human-derived antibody, while preserving the CDR3 conformation in the newly generated protein. We also showcase several examples of ligand-binding pocket scaffolding (Fig. S3†). These proof-of-concept results indicate that PT-DiT can design functional proteins that retain and integrate crucial functional motifs, underscoring its potential in functional protein engineering.
We benchmarked our interpolated conformations against a long-duration molecular dynamics (MD) simulation of Abelson tyrosine kinase (Abl) binding to the cancer drug imatinib,44 in which imatinib binding triggers a switch in the activation loop (A-loop). We analyzed the 10 ms MD trajectory using Time-lagged Independent Component Analysis (TICA),45,46 reducing the protein's conformational landscape to a two-dimensional space (Fig. 4a). Clustering these trajectories revealed three metastable states: I, II, and III (Fig. 4b).
![]() | ||
Fig. 4 PT-DiT performance in metastable state sampling compared with MD simulations and experiments. (a) Free energy surface in TICA coordinates (IC1 and IC2) derived from MD simulations of Abl binding to imatinib. Conformational cluster centers (I, II, and III) are extracted from the trajectories, with scatter points showing the TICA projections of interpolated states: from I to II (yellow), II to III (green), and I to III (red) on the left. On the right, scatter points (green) represent the TICA projections of interpolated states between two experimentally resolved Abl structures (PDB IDs 1OPJ and 2F4J). (b) Protein structure illustration of metastable states during Abl binding to imatinib, sampled by PT-DiT, with the active loop region highlighted in salmon. (c) Closed, intermediate, and open states of MurD. Structures sampled by PT-DiT (colored yellow, salmon, and cyan for closed, intermediate, and open states, respectively) are superimposed with their corresponding crystal structures (gray). (d) Distribution of the collective variable defining the “hinge” motion between MurD domains, as observed in MD trajectories (orange) and interpolated structures (blue). The dashed red, green, and blue lines correspond to the experimentally resolved structures of the closed, intermediate, and open states, respectively. (e) Sampled structures of the inactive state R2 (salmon), and the active states R3 (blue) and R4 (green) of μOR. PT-DiT-generated R2 and R4 structures are superimposed with their corresponding crystal structures (gray). Ensembles of the attached fluorescent molecules (HO-1427) on residues R173 and R182 are displayed. (f) Predicted distribution of distances between fluorescent molecules in R2 (red), R3 (blue), and R4 (green), with average distances shown as dashed lines. |
Interpolation between these states was performed using PT-DiT, and the resulting intermediate structures were projected onto the TICA subspace (Fig. 4a, left). Between the three cluster centers, we identified an intermediate state with a low free energy region as seen from its location on the free energy surface (Fig. 4a and b). We also interpolated between the open (PDB ID 1OPJ) and closed (PDB ID 2F4J) X-ray crystal structures of Abl44 (Fig. 4a, right). The pseudo-trajectory in latent space traverses low free energy regions along the principal components defined by TICA, suggesting that the intermediate states sampled by PT-DiT are physically plausible.
Beyond aligning with the MD simulation trajectory, PT-DiT is designed to capture intermediate states that can be experimentally observed. We illustrate this by examining the transition of UDP-N-acetylmuramoyl-L-alanine:D-glutamate ligase (MurD) from its open conformation (PDB ID 1E0D)47 to its closed conformation (PDB ID 3UAG).48 The pseudo-trajectory generated by PT-DiT reveals domain 3 rotating around its hinge with domain 2 (Fig. 4c and S5†). Notably, one intermediate structure closely resembles the semi-open conformation identified by X-ray crystallography (TM-score = 0.94, PDB ID 5 A5E),49 proposed as a stable intermediate state during MurD's ligand-driven conformational change.
To further check the validity of the interpolated MurD conformations we introduce an angular collective variable, θ (details in Fig. S5†), based on previous MD studies of MurD's conformational dynamics.50 The distribution of θ values for the interpolated structures encompasses most of the range observed in both MD simulations and crystal structures (Fig. 4d). These findings indicate that PT-DiT can reliably generate experimentally validated intermediate conformations, thus offering a rapid alternative for sampling intermediate states of interest for experimental studies.
To further explore PT-DiT's capacity to elucidate hidden states in protein conformational transitions, we examined the μ-opioid receptor (μOR) system. Previous studies indicate that G-proteins bind to an open pocket formed by the outward movement of transmembrane helix 6 (TM6).51 Double electron–electron resonance (DEER) experiments classify the structural ensemble into four populations (R1–R4), distinguished by differing distances between TM6 and TM4. Both active and inactive states of μOR (PDB IDs 6DDF and 4DKL, respectively)52,53 exhibit an approximately 10 Å shift between TM6 and TM4. While R2 and R4 have been validated by X-ray crystallography (Fig. 4e, left and right), R3 remains unresolved. To address this question, we interpolated structures between R2 and R4 using PT-DiT, producing an intermediate conformation, R3′, whose TM4–TM6 distance closely matches the experimentally inferred value for R3 (Fig. 4f, ESI†). R3′ thus serves as a structural model for the unknown active state R3, providing a foundation for further biological analysis and design.
Small ubiquitin-like modifier (SUMO) proteins, which covalently bind to target proteins to modulate their functions, share structural similarities with ubiquitins but exhibit distinct functional outcomes. Despite having nearly identical backbone structures, human ubiquitin (PDB ID 1D3Z)54 and human SUMO-3 C47S (PDB ID 1U4A)55 share only 16% sequence identity (12 out of 76 residues), classifying them as remote homologs. To explore other potential remote homologs that preserve this common backbone, we performed interpolation in PT-DiT's latent space between 1D3Z and 1U4A. Throughout the interpolation pathway, the backbone structure remained largely unchanged while the sequence content evolved. Notably, sequence identity varied in a non-linear manner, forming a “platform” along the interpolation trajectory (Fig. 5A). One intermediate sequence from this platform (Fig. 5B, gray) has only 61% and 20% sequence identity to 1D3Z and 1U4A, respectively, but exhibits a 96% sequence identity to human NEDD8 (PDB ID 3DBH)56—another ubiquitin-like protein that shares the same structural scaffold. These findings underscore PT-DiT's ability to encode the full range of sequence variation that supports a common backbone structure.
![]() | ||
Fig. 5 PT-DiT performance in evolution-like protein discovery and directed evolution. (A) Sequence identity of interpolated proteins compared to ubiquitin (PDB ID 1D3Z) and SUMO (PDB ID 1U4A). The sequence from platform1 shows a 94% identity with NEDD8 (PDB ID 3DBH). (B) Structural alignment of the rediscovered ubiquitin-like sequence from platform1 (gray) with structures of 1D3Z (cyan) and 1U4A (green). (C) Sequence identity of interpolated proteins with respect to two carbonic anhydrases (PDB IDs 3DCW and 3JXG) from different species, highlighting platform1 and platform2. (D) Structural alignment of platform1 and platform2 (gray) with structures of 3DCW (cyan) and 3JXG (green). The sequence of platform1 shows 96% identity with another carbonic anhydrase (PDB ID 1AZM), while platform2 represents a de novo sequence modeled on the carbonic anhydrase backbone. (E) Overview of directed evolution with PT-DiT based on an active learning algorithm. (F) Proportion of PT-DiT, ESM2, and their combination achieving the highest top activity and highest median activity in the proposed sequences, with each method using 10 rounds of active learning and 10 variants per round. (G–I) High-activity candidate percentages across 10 rounds of simulated directed evolution for different DMS datasets. Panels (G), (H–I) show the results for PT-DiT, ESM2, and the combination of PT-DiT and ESM2, respectively. |
We next investigated whether this interpolation strategy could produce de novo sequences. Carbonic anhydrases (EC 4.2.1.1), which catalyze the reversible conversion of carbon dioxide and water into carbonic acid, play a vital role in regulating pH and the acid–base balance in biological systems. Specifically, we performed interpolation between a human carbonic anhydrase (PDB ID 3DCW)57 and a homolog from Mus musculus (PDB ID 3JXG),58 which share 31% sequence identity. The resulting trajectory showed uneven changes, forming two “stable sequences” (Fig. 5C). The first stable sequence aligns with human carbonic anhydrase I (PDB ID 1J9W)59 at 93% sequence identity. Intriguingly, the second sequence has no close natural homologs in the UniRef100 database60 (maximum sequence identity 64%). Nonetheless, both ESMFold and AlphaFold3 confidently fold this sequence into the same backbone as 3DCW and 3JXG (Fig. S6†), demonstrating its ability to stabilize the carbonic anhydrase scaffold. These findings show that PT-DiT's unified sequence–structure representation not only reveals potential remote homologs but also generates novel sequences, presenting a fresh paradigm for protein design.
Here, we use 12 Deep Mutational Scanning (DMS) datasets demonstrated in EVOLVEpro33 to simulate directed evolution, focusing on how different protein representations influence evolutionary outcomes. Unlike protein language models (PLMs) that rely solely on sequence- or MSA-based representations, we benchmark PT-DiT embeddings—which learn from both the structure and sequence independently—against the ESM2-15B baseline within EVOLVEpro. This integrated approach is expected to incorporate more structural information into the latent vector space. We also concatenate PT-DiT embeddings with ESM2-15B embeddings to form unified embeddings and explore whether these different embeddings complement or conflict with each other.
The results indicate that PT-DiT embeddings perform comparably to ESM2-15B embeddings across all 12 DMS datasets (Fig. 5F–I and S7A–F†). For top-1 activity predictions, the unified embeddings achieve 75% of the best outcomes, while PT-DiT and ESM2-15B respectively yield 17% and 8%. When evaluating median activity, unified embeddings achieve 58%, PT-DiT 33%, and ESM2-15B 9%, respectively. Notably, PT-DiT excels in systems involving signal transduction (Fig. S7E†), viral replication (Fig. 5G), and enzymatic activities (Fig. 5I), where incorporating structural information is thought to be crucial for capturing the functional impact of mutations.
Although PT-DiT may underperform ESM2-15B in certain high-activity prediction scenarios, combining embeddings from both models does not compromise overall accuracy (Fig. 5H). In contrast, this combined strategy improves the proposal rate for high-activity variants (Fig. 5I), enhancing both median and top activities among the predicted variants. Given that ESM2-15B embeddings in EVOLVEpro are 5120-dimensional whereas PT-DiT embeddings are only 40-dimensional, these findings underscore the compactness and efficacy of PT-DiT embeddings, suggesting the latter to be both more tractable and better suited for downstream biophysical applications.
From the perspective of the pretraining model of proteins, sequences and structures are interconnected in two fundamental ways: structures are linearized into sequences of neural amino acids, while from another viewpoint, amino acids represent nature's own strategy for compressing and storing protein structures. Consequently, leveraging ProTokens enables the convenient training of this foundational model, PT-DiT, to capture the joint probability of sequences and structures. Through generative pre-training, PT-DiT not only facilitates the co-design of matching sequences and structures but also supports the tailored design of proteins based on specific contexts, enhancing its versatility as a tool for protein engineering.
Furthermore, PT-DiT generates latent representations of proteins (both sequences and structures) that form a more structured space, facilitating higher-level abstractions of the protein universe. We discovered that perturbing, interpolating, and evolving proteins in this latent space give rise to numerous applications. For example, interpolating between two conformations of the same protein helps identify potential intermediate states of protein dynamics. These encrypted conformational states may offer new insights into the possible mechanisms by which proteins perform their functions. Simultaneously, we observed that interpolating between two sequences sharing a common backbone structure aids in identifying other sequences that can also stabilize it. Rediscovering naturally occurring sequences may reveal evolutionary pathways, while the discovery of de novo sequences offers a new methodology for protein design. Furthermore, the compact and organized latent space proves suitable for evolving proteins to enhance their activities toward specific objectives. Through adaptively learning activity profiles of mutations across multiple rounds of experimental feedback, PT-DiT yields high-activity candidates among both selected and proposed variants (more details in ESI, Fig. S10 and S11†).
The experiments performed in this study showed that unifying sequence and structure modalities of proteins using ProTokens is highly beneficial. From a physical standpoint, we verify the hypothesis that while the spaces of protein structures are inherently large, the metastable states can be considered countable and discretizable. We further developed efficient algorithms to discretize and represent these metastable states as machine-learned neural amino acids. Technically, compared to redundant and complex structural representations such as atomic coordinates or surface meshes, ProTokens provide a more compact and regular format for protein structures, making them more machine-friendly and suitable for computational models in structural biology. Furthermore, this amino-acid-like representation naturally integrates with protein sequences, enabling joint modeling of both the sequence and structure.
By mapping the structure to ProTokens, we have demonstrated that structure comparison and protein design benefit significantly from the compactness and convenience ProTokens provide. Moreover, by utilizing ProTokens in a latent space that organizes proteins to accommodate a generative objective, applications such as directed interpolation and evolution naturally emerge, significantly facilitating protein engineering. However, there are certain limitations to our approach. First, ProTokens, while effective, are far from an optimal code for proteins. In practice, we have observed a “degeneracy” in which highly similar structures may be encoded into different ProToken sequences. This degeneracy can hinder several downstream applications by increasing model confusion and suggests the potential for a more compact representation and further compression. Second, the exploration of PT-DiT's possible applications is still in its early stages. Many promising avenues remain to be explored, such as large-scale structural sampling, evolutionary analysis, functional annotation, and performing Bayesian optimization of protein physicochemical properties based on latent space representations. Future efforts will focus on improving compression rates and expanding the scope of applications for PT-DiT.
The multi-domain reconstruction task is evaluated using the CASP14 and CASP15 multi-domain datasets, which include domain annotations published on the CASP website. This process yields 17 multi-domain samples for CASP14 and 13 for CASP15. The multimer reconstruction task is assessed using the benchmark dataset of AF2Complex. More details of multi-domain and multimer datasets can be found in the ESI.†
For MurD, we used MD trajectories from a previous study in which simulations were performed with Amber ff14SB under NAMD, using a 2 fs timestep for a total of 200 ns. All conformations from this trajectory were included in the variable analysis; during the course of the simulation, MurD transitioned from its closed to open state. The collective variable for MurD was defined based on the centers of mass of three residue selections: residues 120–230, 230–299, and 299–437.
The encoder fθ is a parameterized SE(3)-invariant mapping that transforms a protein structure x with Nres residues into a ds dimensional single representation and a dp dimensional pair representation
derived from its distance matrix and backbone dihedrals. Inspired by the EvoFormer and structure module in AlphaFold2, we form a scalable “sandwich-like” transformer module comprising 2, 4, and 2 layers of EvoFormer, ResiDual transformer, and EvoFormer, respectively. These layers are designed to update single representations and pair representations, and co-update single and pair representations sequentially. Subsequently, a structure module is utilized to enhance the encoder's structure-awareness by aggregating information from the processed s, p and the raw structure x, and finally output a d dimensional representation
for each residue r (1 ≤ r ≤ Nres).
For the tokenizer hθ, we employ vector quantization (VQ)23 techniques commonly used in image tokenization. The VQ module dynamically maintains and updates a “codebook” {ci}, which serves as cluster centers in the vector space. Each input vector fθ,r(x) is assigned to the nearest code ci in the codebook via a nearest neighbor search:
![]() | (1) |
In most protein engineering scenarios, it is essential to represent protein metastable states defined on a functional timescale, where τobs = τfunc < τfold. To this end, we use structures from the RCSB database as the training set, as they are often regarded as functionally relevant conformations of proteins. Metastability implies that the structural ensemble {x}τ≪τobs should be tokenized into the same set of ProTokens, with irrelevant fluctuations at τ ≪ τobs treated as noise. To mimic the distribution of structural fluctuations, we augment the structural data by applying perturbations (details in the ESI†). Additionally, we apply an alignment loss27 to constrain the embedding vectors of two structures x and x′ belonging to the same metastable state:
![]() | (2) |
To further optimize the representation, a uniformity loss is introduced to encourage the code distribution to be as uniform as possible, ensuring efficient use of the latent space,
![]() | (3) |
Details on the design rationale of the models and training algorithms are provided in the ESI.†
By virtue of the regularity of the compact vector space p(zx, zs), adapting diffusion models to generate proteins is as straightforward as generating images or videos. We employ standard denoising diffusion probabilistic models (DDPMs),66 where the noise prediction model εθ(zt, t) is trained to reverse a Markov diffusion process which is equivalently defined by the following transition function of probability,
The architecture of εθ(zt, t) follows that of diffusion transformers, utilizing 24 transformer layers with a hidden size of 512 to predict noise from the perturbed protein embedding zt. The following “noise-matching” loss is defined for training,
After training, the (x, s) pair can be generated via the ancestral sampling scheme of DDPM, which samples the backward diffusion process, defined as:
Alternatively, other samplers based on stochastic differential equations or ordinary differential equations can also be applied.25 The model architecture closely follows that of diffusion transformers, and state-of-the-art image diffusion models proven to scale effectively in image generation. Moreover, PT-DiT is highly adaptable to the latest advancements in both training and sampling techniques developed for image diffusion models, as we made minimal modifications to the model architecture and training algorithms.
By leveraging the RePaint strategy, although PT-DiT is trained to model the joint probability distribution of entire protein sequences and structures, it can be directly applied to many conditional generation tasks in protein engineering without the need for additional training or fine-tuning. Mathematically, PT-DiT combined with RePaint algorithms solves the conditional generation problem p(zx, zs|zcx, zcs), where zcx and zcs represent context ProTokens and amino acid embeddings, respectively. For example, in the inverse folding task, zcx = zx corresponds to the full ProToken sequence derived from input backbone structures, while zcs = ∅. For scaffolding, zcx, zcs encode the structure and sequence information of the functional site that is preserved during generation.
To demonstrate how PT-DiT with the RePaint algorithm can graft specific functional motifs onto a different scaffold, we use the CDR3 of antibodies as an example. The procedure proceeds as follows: as shown in Fig. 3e, we first crop the CDR3 from an antibody heavy chain with a known structure (PDB ID 5JXE).67 Since loop-like conformations are ubiquitous in protein structures, merely specifying a CDR3 loop as context cannot ensure the generated structures belonging to the family of antibodies which exhibit specific structural constraints. Therefore, we need to precondition the contextual ProTokens towards antibody-like structures. Specifically, we first select a human germline structure as the template, and then replace its CDR3 region with the to-be-grafted loop by superimposition. The backbone of this artificial grafted structure is encoded into backbone tokens, among which the CDR3 loop is cropped and set as context along with its sequence. The lengths of the flanking FWRs and CDRs can be sampled according to the distribution of the human germlines, while we set them the same as in 5JXE for simplicity. Through inpainting sampling, we can obtain ProTokens that can be decoded to all-atom structures containing the target CDR3 loop as well as the amino acid sequence for the entire chain.
Therefore, a DDPM is also a latent generative model, where a deterministic one-to-one mapping between the input vectors z(t = 0) and the latent vectors z(t = 1) is constructed by solving the initial value problem (IVP) using any black-box ODE solvers, such as Euler and Runge–Kutta solvers.
The strategy used to interpolate between two proteins in latent space is straightforward. After mapping protein A and protein B (represented as zA(0) and zB(0)) into latent vectors (using the explicit 4th order Runge–Kutta method), zA(1) and zB(1), we obtain their intermediate latent vectors via linear interpolation:
zλ(1) = (1 − λ)zA(1) + λzB(1) |
Subsequently, the intermediate proteins (zλ(0)) are obtained by solving the backward IVP.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5sc02055g |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2025 |