Ria Vinod‡
a,
Pin-Yu Chen*b and
Payel Das*b
aDepartment of Computational and Molecular Biology, Brown University, USA. E-mail: ria.vinod@brown.edu
bIBM Research, USA. E-mail: pin-yu.chen@ibm.com; daspa@us.ibm.com
First published on 23rd May 2025
Machine learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale models is computationally expensive. Herein, we propose representation reprogramming via dictionary learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to 104 times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram several recent state-of-the-art pretrained English language classification models (BERT, TinyBERT, T5, and roBERTa) and benchmark on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, and solubility) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity, and protein–protein interaction).
The technique of pretraining deep learning models was proposed to address this issue. Pretraining methods leverage large amounts of sequence data and can learn to encode features that can explain the variance seen in sequences across biological task-specific training samples. In the context of protein sequences, pretraining has enabled meaningful density modeling across protein functions, structures, and families.12 In this work, we reference two types of pretraining methods: (i) unsupervised pretraining, where all data are unlabeled, and (ii) self-supervised pretraining, where a model learns to assign labels to its unlabeled data. Large models then pretrain on massive amounts of unlabeled data, specifically biological sequences, which are available at scale. Once pretrained, these foundation models (FMs)13 are fine-tuned on smaller amounts of labeled data, which correspond to a specific downstream task. Interestingly, for the large-scale models pretrained on protein sequences, biological structure and function seem to emerge in the learned protein representation, even though such information was not included in model training.10
Although highly powerful, the training of those domain-specific foundation models from scratch is highly resource-intensive.14 For example, one training run of BERT (the language model considered in this work) learns 110 million parameters, costs up to $13000 USD and takes 64 days (without parallelized computing) and results in 0.7 tons of carbon emissions.15 A single training run of another popular language model, the T5 transformer, learns 11 billion parameters, costs up to $1.3 million USD, takes 20 days, and results in 47 tons of carbon emissions.16,17 Such pretrained language models and size variants are abundantly available with the advent of model libraries (e.g., Hugging Face18) which host pretrained models and datasets. The scale of data, computing, and financial resources required to train these models is not only available to a limited number of researchers but also infeasible for applications with limited labeled data. To this end, we propose a lightweight and more accurate method for large-scale pretraining from alternative domains. Specifically, we introduce a method to reprogram open-sourced, classification foundation models of high capacity that are trained on data from a different domain. This situation calls for innovations in cross-domain transfer learning, which is largely unexplored, particularly in scientific domains.
One known fact is that biological sequences are similar to natural language, as they also contain long-range dependencies and follow Zipf's law.19 These sequences and their associated dependencies are crucial for determining their structural and functional properties. Such similarity has motivated the use of deep learning architectures and mechanisms that are widely used in natural language processing (NLP) to build protein sequence models from scratch. Work along this line of inquiry consists of training models jointly on protein and text sequences,20–22 by minimizing a contrastive loss on the representation of text obtained from an English language encoder and the representation of protein sequences obtained from an amino-acid encoder. However, this style of work relies on the availability of high-performance in-domain protein language models. In this work, we remove any reliance on pretrained protein models by exploring an alternative warm-start paradigm, i.e. how to effectively and efficiently reprogram an existing, fully trained large English language model to learn a meaningful (i.e., biomedically relevant) representation of protein sequences. Our proposal can attain better data efficiency because it only requires in-domain task-specific labeled data and spares the need for accessing in-domain unlabeled data for pretraining. The goal is to create a more carbon-friendly, resource-efficient, and broadly accessible framework to motivate different scientific domains toward democratizing the representation power of large AI models. This warm-start paradigm is defined by the framework's ability to achieve the performance of transformers that are pretrained on billions of tokens in alternative domains, with a lighter-weight training procedure that is similar to that of a standard supervised classifier trained from scratch. In particular, we consider highly specific biological and biomedical protein sequence datasets (illustrated in Fig. 1) which have much fewer samples than standard supervised language task datasets. Reprogramming thus provides a more data and resource-efficient approach to developing models to achieve deep representational capacity and performance for downstream protein tasks. Reprogramming has been previously explored as a cross-domain transfer learning methdology,26,27 reprogramming language models were explored for alternate text classification tasks,28 reprogrammed acoustic models were explored for time series classification,29 and reprogrammed ImageNet classification models were explored for alternate image classification tasks. However, none of these methods investigated mappings between domains that require a very high representational capacity (from natural language to biological sequence), which is the setting we require in the protein sequence domain.
![]() | ||
Fig. 1 (Left) Descriptions of considered predictive tasks. We select the set of physicochemical property prediction tasks from the well-studied domains in ref. 11 and the biomedical function prediction tasks from studies with biomedically relevant small-sized labeled datasets.9,23,24 (Center) Comparison of R2DL to in-domain pretraining and standard supervised training methods. We refer to supervised methods as standard supervised classifiers that are trained from scratch from labeled data alone. Depending on how labeled and unlabeled data are used in pretraining, we consider in-domain pretraining to constitute unsupervised/supervised pretaining schemes using in-domain data. (Right) A comparative table showing the broad adaptability of the R2DL framework. In comparison to the existing gold standard methods, R2DL has a broader utility across different domains, sizes of training datasets, and data efficiency. We categorize supervised methods as cross-domain adaptable through various domain adaptation and transfer learning techniques.25 |
Toward this goal, we introduce R2DL (representation reprogramming via dictionary learning), a novel cross-domain transfer learning framework to reprogram an existing pretrained large-scale deep-learning model of the English language, namely an English language model such as BERT,15 to learn and predict physicochemical and biomedical properties of protein sequences. The success of pretraining in language models (LMs)15,17,30,31 is drawn from the principles of fine-tuning and in-context learning. With models of a large enough size, several studies have demonstrated emergent properties in language models.32 That is, knowledge about the language task emerges in learned representations of the models, even when there are very few in-context examples.30 Protein language models (PLMs) have demonstrated similar emergent properties, where learned representations exhibit knowledge about the structure and function of proteins, even when this information was not included in the training data. The success of LMs in the protein sequence domain can be attributed to the similarities in the structure of the modes of English and protein sequence data. Recent work demonstrates that there appear similarities between English and protein domain grammar.24 As language follows linguistic rules, referred to as “grammar”, only a subset of protein domains is viable as per evolution. These domains are determined by protein sequences, which are thus governed by some evolutionary rules, or “proteome grammar”. Recent work formalizes this concept, showing that with a linguistic probe analysis, English language (text sequences) and amino acid (protein sequences) both follow a Zipfian distribution and confirm the presence of a quasi-universal grammar.24 Under this paradigm, R2DL emerges as an intuitive method for cross-modal learning between the language and protein domains.
In step 3, the system maps the source task labels (e.g., positive/negative sentiments) to target task labels (e.g., toxic/non-toxic proteins) and optimizes the embedding mapping parameters based on the task-specific loss evaluation on a given protein sequence dataset. Finally, in step 4, the reprogrammed model is deployed for the test-time evaluation.
To the best of our knowledge, our work remains the first to address reprogramming in any biological, and more broadly, scientific domain. We posit reprogramming as an alternative to fine-tuning, as a method in which significantly less in-domain training data are required to achieve the same predictive performance across protein tasks. Further details on the model-data paradigm of fine-tuning versus reprogramming are provided in the ESI (Appendix 1†). In Fig. 1, we illustrate the set of protein physicochemical and functional property prediction tasks we consider, as well as the baseline methods against which we compare the R2DL performance to, and a brief description of R2DL's advantages compared to these existing methods. We test the reprogrammed model for a range of biomedically relevant downstream physicochemical property, structure and function prediction tasks, which include prediction of the secondary structure, homology, mutational stability, solubility, as well as antimicrobial nature, toxicity, antibody affinity, and protein–protein interaction of proteins. Each of these tasks involves learning on datasets that are limited to a few thousand labeled samples, at least an order of magnitude smaller needed to train an in-domain foundation model or a large in-domain protein language model.33 R2DL uses dictionary learning, a machine learning framework that finds the optimal sparse linear mapping between the English vocabulary embeddings and the amino acid embeddings. To do so, a protein property prediction task-specific loss is used to learn the optimal parameters of the reprogrammed model. We train R2DL in a supervised setting with the downstream protein prediction task datasets that are labeled and small in size (illustrated in Fig. 1). R2DL demonstrates consistent performance improvement from existing baselines across seven different physicochemical (e.g., up to 11% in stability), structural, and functional property prediction (e.g., up to 3% in toxicity) tasks of proteins. We estimate R2DL to be over 104 times more data-efficient than existing in-domain pretraining methods. We further demonstrate the performance robustness of R2DL when trained on a reduced-size version of the supervised protein datasets. In addition, we show that R2DL learns to encode physicochemical and biomedical properties in the learned representations, even in limited-data scenarios. This work thus blazes a path toward efficient and large-scale adaptation of existing foundation models toward different real-world learning tasks and accelerates scientific discovery, which naturally involves learning from limited real-world data.
For a protein sequence classification task, we use the source model trained on an alternative-domain language task for which there are ns sentence output classes (e.g., “positive” and “negative” for sentiment classification) and nt protein sequence classes (e.g., “toxic” and “non-toxic”), where ns ≥ nt. The output-label mapping h is then a simple one-to-one surjective correspondence between the source task labels and the target task labels, which ensures that every target class is assigned with a unique source class (e.g., positive → toxic and negative → non-toxic). For a regression task, R2DL uses a mapping between the regression values in the protein sequence feature space and the classification probability values in the source model embedding space. It does so by learning optimal thresholds of regression values that map to the source model class labels. The input data of the source English language model are tokenized at the word level. These tokens form the atoms for our dictionary representation of VS, a matrix with its rows corresponding to embedding vectors of source tokens. The input data to the target task, protein sequences, are tokenized on a character level with only 20 distinct tokens (corresponding to a set of 20 discrete natural amino acid characters). R2DL obtains VS from the learned embeddings of the source model and learns to represent VT, the matrix of the target token embedding, as a weighted combination of the English token embeddings. We propose token reprogramming by approximating a linear mapping between VS and VT. That is, we aim to find a transformation of the latent representation of the protein sequences, such that it can be embedded in the pretrained language model's latent space and enable R2DL to leverage these re-embedded tokens for learning. Specifically, we learn the linear map Θ by approximating a dictionary using a k-SVD solver.35 That is, we want to approximate VT = ΘVS. The k-SVD solver guarantees a task-specific level of sparsity in the coefficients when linearly combining English token embeddings to represent a protein sequence token embedding. In other words, it helps select k English tokens and use their linearly combined embeddings as the embedding of a target token. Additionally, with a one-to-one label mapping function of the English text label to the protein sequence label, we are able to use the pretrained language model for inference on the embedded protein dataset, VT. We thus design an end-to-end reprogramming framework for any arbitrary protein sequence classification or regression task.
To approximate the dictionary, we use a k-SVD solver to optimize model weights with the cross entropy loss for updates to Θ. The critical settings that determine the approximate convergence of the k-SVD algorithm are the initialization of the dictionary and the choice of k, the number of non-zero coefficients. We initialize the dictionary with the results of k-means clustering on VS, a common approach. To adequately capture the underlying data distribution, we perform a grid search to identify the best choice of k. For each task-specific instance of R2DL, we train our model for the number of k-SVD iterations until convergence. The number of iterations used for each model can be found in Appendix 4 (ESI†).
We then apply the assigned label mapping h for protein classification tasks, or thresholding for regression tasks, and train the mapping function Θ using gradient-based optimization evaluated on the task-specific cross-entropy loss. Details for the R2DL training procedure are given in the Method section.
![]() | ||
Fig. 3 (a) Prediction accuracy and (b) data efficiency of R2DL, pretraining methods, and standard supervised training methods (trained from scratch). The prediction accuracy is the top-1 accuracy for each task-specific instance of R2DL. Each downstream task has a different number of classes, n, for the n-way classification tasks. Data efficiency is defined as the ratio of prediction accuracy to the number of in-domain samples used in training (including pretraining and fine-tuning methods). R2DL attains comparable accuracy to in-domain pretraining methods while attaining high data efficiency. Error bar values and variance are reported in Table S19.† Model and training details on the reported performance of each downstream task are available in the ESI.† |
![]() | ||
Fig. 4 Biologically relevant correlations learned by R2DL. (a–c) Clustering of R2DL learned embeddings for secondary structure prediction, toxicity prediction, and antibody affinity prediction tasks. When tagged by protein property classification, we see high correspondence between the cluster and protein sequences with the same physicochemical or biomedical property classification. Rives et al.37 shows that “biochemical properties of amino acids are represented in the Transformer model's output embeddings”. We use their result as ground truth annotations to draw parallels with our finding of token representations in R2DL embeddings – leucine (L) and proline (P) are both polar; cysteine (C) and methionine (M) are both polar; tryptophan (T) and isoleucine (I) are both hydrophobic. We find these biochemical similarities between amino acids consistent with our results. (d) For the antibody affinity prediction task, we observe a high correlation coefficient along the diagonal. This shows that the representation learned by R2DL is highly similar to empirical observations of pairwise residue correlations. See Appendix 7 (ESI†) for comparisons to in-domain pretraining. |
We further calculate the similarity in the Euclidean distance between the latent representations at the last layer for each amino acid embedding and compare it to the pairwise evolutionary distance with the BioPython module. In Fig. 4(d), we show the Euclidean distances between the latent embeddings learned in the R2DL model and the pairwise evolutionary distances between protein sequences, as estimated using the BLOSUM62 matrix implemented in the pairwise function of the BioPython module. The matrix shows a correlation of close to 1.0 along the diagonal, showing a perfect correspondence between the learned representation and the empirical observations of amino acid relatedness. R2DL thus captures the underlying structure of the linear sequence of amino acid residues in protein sequences in the context of the protein task reprogrammed.
While R2DL does make gradient updates in the framework, the data and resource requirements of the R2DL method are much lower than any unsupervised or self-supervised pretraining approach for protein sequence modeling. Though R2DL has the same data and resource requirements as any standard supervised training approach, R2DL demonstrates much higher task accuracy across a broad and diverse range of property prediction tasks. We claim that R2DL is able to do this because of efficient cross-domain model adaptation via reprogramming, which standard supervised models cannot achieve without an unjustifiably large number of parameters. R2DL is thus more efficient than the existing baseline models in the following aspects: (i) R2DL only requires a pretrained transformer (trained on English language data) and a small-sized, labeled protein sequence data set of interest. We do not make any updates to the pretrained model itself, unlike traditional transfer learning methods. Rather, we make updates to the R2DL model during a supervised training process that optimizes over class-mapped labels. (ii) R2DL does not require large-scale un/self-supervised pretraining on millions of unlabeled protein sequences, as in ref. 9–11. (iii) Furthermore, R2DL does not require any large-scale supervised pretraining, which has been found to be beneficial in protein-specific tasks11 as well as in computer vision.43 Labeling protein sequences at scale, particularly for biomedical functions, is almost infeasible for the size of dataset that is required for supervised pretraining. With these three considerations in mind, we pose R2DL as a data-efficient alternative to pretraining methods for protein prediction tasks of biological and biomedical relevance. To the best of our knowledge, R2DL is the first framework without explicit pretraining that facilitates accurate predictions across a general suite of protein prediction tasks and provides interpretable correspondences between amino acid features that are very closely aligned with domain knowledge (evolutionary distances). The success of R2DL can be attributed to its representational power to encode a sparse representation by leveraging the natural language modeling entailed in large language models for efficient learning on protein structure and function prediction tasks, as both English and protein sequences follow Zipf's law.19 We first demonstrate the effectiveness of R2DL on a set of physicochemical structure and property prediction tasks and then on a set of biomedically relevant function prediction tasks for protein sequences. We show predictive performance improvements against pretrained methods (up to 11% in stability) and standard supervised methods (up to 3.2% in antibody affinity). Similarly, on the remaining tasks, we show performance improvements over the best reported baseline in structure prediction (4.1%), homology (2.3%), solubility (7.1%), antibody affinity (3.2%), toxicity (2.4%), and PPI (1.6%). R2DL thus shows the capability to learn a general representation of protein sequences that can be efficiently adopted to different downstream protein tasks. These powerful representation capabilities are evidenced by its ability to achieve high performance across protein datasets with a highly varied number of task-specific training samples. The performance of R2DL across protein tasks show the potential to repurpose and develop powerful models that can learn from small, curated, and function-specific datasets. This mitigates the need to train large pretrained models for peptide learning tasks. We thus provide an alternative method to pretraining that is cheaper to run and more accurate and therefore adaptable to broader researcher communities who may not have access to large-scale computing. This potential is critical for many applications, such as the discovery of new materials, catalysts, as well as drugs.
The success of the R2DL reprogramming framework is dependent on the available pretrained English language transformers and protein sequence datasets. In this work, we consider several standard BERT-style transformers due to the open-sourced codebases available at the time. However, the field has seen many advancements in developing large language models that are highly performing across several benchmarks but are not open sourced. It is possible that adversarially reprogramming newer generative models such as the GPT family or Gemini could improve cross-domain transfer learning to protein sequence tasks. Additionally, k-SVD is an approximation of the dictionary weights, Θ. While we choose hyperparameters from the appearance of the convergence of the k-SVD algorithm, it is a computationally expensive method which is feasible on the order of 109 million parameters. Attempting to reprogram larger models with R2DL could require a more computationally efficient approximation algorithm.
We report the performance of the R2DL approach trained with different source natural language models in Table S20.† These results show that different vocabularies, distributions of training data, model size, and training regimes of the source model can affect the performance of R2DL. While we include a preliminary analysis of this question, we leave a more exhaustive study of the choice of source model study to future work.
• Procedure hyperparameters: maximum number of iterations T1 for updates to Θ, number of iterations T2 for k-SVD, and step size
• Procedure initialization: random initialization of Θ, obtain the source token embedding matrix VS.
• Define objective function: objective function for k-SVD: ‖VT – ΘVS‖ ≤ ε.
• k-SVD approximation of Θ: If t1 ≤ T1, while t2 ≤ T2 use approximate k-SVD to solve VT ≈ ΘVS, t2 ← t2 + 1.
• Calculate the loss and perform gradient descent: t1 ← t1 + 1 and return to the previous k-SVD step.
• Output protein sequence labels for protein sequence x of task
We are given a pretrained English classifier, C, and a protein sequence target-task dataset We denote the task with
such that
We also encode an output label mapping function
specifying the one-to-one correspondence between source and target labels. As shown in Fig. 2, the source vocabulary embedding, VS, is extracted from the pretrained model, C. The next objective is to learn Θ that approximates the embedding of tokens in
(denoted as VT) in the representation space of the source model. We aim to learn Θ ∈ Ra×b that finds the optimal coefficients {θt} for each of the target tokens t ∈ {1,…, a} in VT ∈ Ra×m to be represented as a sparse encoding of the dictionary, VS ∈ Rb×m, such that VT = ΘVS. For a given target protein sequence x from the
-th task, Θ is used to perform the target task through the transformation
. While we do not make any modification to the parameters or architecture of C, we assume access to the gradient ∇Θloss(·) for loss evaluation and parameter updates during training.
A target token embedding vt ∈ Rm can be represented as a sparse linear combination of the source token embeddings (rows) in VS, vt = θtVs. Here, vt is the representation of the protein token in the dictionary space and satisfies ‖vt − θtVs‖p ≤ ε, where ‖·‖p is an Lp norm and θt is made to be sparse by satisfying ‖θt‖0 ≤ k for all t. An exact solution vt = θtVS is computationally expensive to find and is subject to various convergence traps, so for the purpose of our efficient fine-tuning approach, we approximate vt ≈ θtVS using k-SVD. We first fix the dictionary VS, as extracted from C, and then find the optimal Θ according to the optimization problem, by minimizing the alternative objective subject to ‖VT − ΘVS‖2F ≤ ε as explored in ref. 35. While algorithms exist to choose an optimal dictionary (an exact solution to k-SVD) that can be continually updated,35 we penalize computational expense over performance for the purpose of maintaining an efficient solution (at the cost of statistically insignificant improvements in accuracy) by using a predetermined number of iterations for k-SVD convergence, which is then used to evaluate the cross entropy loss on
and update the mapping function Θ. The ESI† contains citations to ref. 44–51, including background materials for the relevant datasets and methods.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00195h |
‡ The majority of this work was done during Ria Vinod's internship at IBM Research. |
This journal is © The Royal Society of Chemistry 2025 |