Rosario
Vanella†
ab,
Gordana
Kovacevic†
ab,
Vanni
Doffini
ab,
Jaime
Fernández de Santaella
ab and
Michael A.
Nash
*ab
aDepartment of Chemistry, University of Basel, 4058 Basel, Switzerland
bDepartment of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland. E-mail: michael.nash@unibas.ch
First published on 24th January 2022
Enzyme engineering is an important biotechnological process capable of generating tailored biocatalysts for applications in industrial chemical conversion and biopharma. Typical enhancements sought in enzyme engineering and in vitro evolution campaigns include improved folding stability, catalytic activity, and/or substrate specificity. Despite significant progress in recent years in the areas of high-throughput screening and DNA sequencing, our ability to explore the vast space of functional enzyme sequences remains severely limited. Here, we review the currently available suite of modern methods for enzyme engineering, with a focus on novel readout systems based on enzyme cascades, and new approaches to reaction compartmentalization including single-cell hydrogel encapsulation techniques to achieve a genotype–phenotype link. We further summarize systematic scanning mutagenesis approaches and their merger with deep mutational scanning and massively parallel next-generation DNA sequencing technologies to generate mutability landscapes. Finally, we discuss the implementation of machine learning models for computational prediction of enzyme phenotypic fitness from sequence. This broad overview of current state-of-the-art approaches for enzyme engineering and evolution will aid newcomers and experienced researchers alike in identifying the important challenges that should be addressed to move the field forward.
There is therefore an urgent need for improved methods for molecular analysis and enhancement of therapeutic enzymes. Naturally occurring enzyme sequences are typically not suitable as biopharmaceuticals due to a general lack of stability, developability, and/or activity. In this context, molecular enhancement by improvement of colloidal stability, catalytic turnover rate, substrate binding affinity, and/or sensitivity to environmental conditions are essential steps in enabling therapeutic enzymes to reach their full potential. The establishment of rapid design, build, test, and learn cycles and the analysis of large-scale sequence-function relationships will be crucial for the advancement of therapeutic enzymes towards clinical translation.
Laboratory directed evolution is by now a well-established paradigm for improving enzyme properties, having been recently awarded a Nobel Prize.3 This process mimics natural evolution by applying selection pressure on a library of genetic variants of a parent enzyme sequence, and propagating proteins with the desired function into subsequent generations, which are then further subjected to diversification and phenotypic screening/selection. Despite the general success of enzyme directed evolution, current technologies only scratch the surface of the vast space of protein sequences. New methods for efficiently exploring productive sequence space, and rapidly screening phenotypes are therefore as important as ever. Recent commoditization of massively parallel DNA sequencing technologies (i.e., next-generation sequencing) are further providing new capabilities for generating large datasets of sequence-function pairings. The purpose of this review article is to explore recent developments within several thematic areas of enzyme engineering with an emphasis on screening methods design and new workflows supported by next generation sequencing and machine learning.
As the need for the improved enzymes in various industries has increased over the years, methods for activity detection in enzyme engineering experiments have been fine-tuned. It has become important that enzyme assays have high sensitivity and are adapted to medium- and high-throughput screening campaigns. The numerous examples of coupled assays used in directed evolution and enzyme engineering outlined below have demonstrated the versatility of enzyme cascades in this context.
More recently, Begander and colleagues11 developed a similar two-step reaction scheme to assess the enzymatic activity of a D-glycerate dehydratase. In this work, the second reaction uses the same set of enzymes to convert pyruvate into Bindschedler's green dye. It is noteworthy that the key elements of a previously built pathway were successfully transferred to a screening system for a new target enzyme. This demonstrates the capacity of enzyme cascades to widen the applicability of pre-existing screening systems.
Other directed evolution experiments have not only used multi-enzyme cascades to produce a readable output, but also evolved multiple enzymes within a cascade simultaneously. These directed co-evolution experiments targeted two cellulases (an endoglucanase and a β-glucosidase) expressed from a single operon in E. coli. The operon was targeted by error-prone PCR to generate mutagenic libraries. Screening took place as a result of co-expression, where the conversion of the insoluble substrate to oligosaccharides was catalysed by the endoglucanase. Subsequent activity of β-glucosidase produced glucose, which was in turn used as a substrate to the glucose oxidase/HRP cascade that produced a colorimetric dye.12,13 Most interestingly, this experimental setup enabled screening for synergistic effects between individual components in the cascade. The enzymes were evolved individually and simultaneously, with the latter approach proving more effective.
The development of microfluidic high-throughput screening (HTS) methods to detect changes in absorbance is also of special interest for enzyme evolution. The technique allowed in vitro evolution of an L-phenylalanine dehydrogenase by coupling its activity to a reaction that forms a formazan dye through the oxidation of NADH. This work opens a window of opportunity to evolve a wide range of enzymes with HTS methods, which were previously unavailable.14 This system does not use enzyme cascades directly coupled to the reaction of interest, but instead relies on a tetrazolium dye for coupling, an approach that allowed for a 25-fold improvement in detection when compared to direct NADH detection. This shows the power of signal amplification and the applicability of coupled assays within the setting of directed evolution in HTS methods.14
Fluorescence-based detection is generally more sensitive than absorbance-based detection, and much of the recent research has focused on creating fluorescent outputs from enzyme-coupled reporter systems. For example, directed evolution of geraniol synthetase was enabled by the implementation of an enzyme-coupled assay in vivo. Activity of this enzyme resulted in the accumulation of the reduced cofactor NADH, which served as a co-substrate for a secondary reaction catalyzed by diaphorase, resulting in production of the red fluorescent compound resorufin.15 Such a strategy was used for developing a HTS cellulase assay in which expressed variants of cellulases were isolated in droplets together with their encoding genes, the reaction substrate (i.e. carboxymethyl cellulose), and the readout enzymes hexose oxidase and vanadium bromoperoxidase. The former enzyme is a promiscuous oligosaccharide oxidiser which produces H2O2 whilst the latter is the output enzyme producing a positively charged fluorophore in proportion to H2O2 abundance.16,17 Enzyme coupling has also aided the efficient engineering of highly stereoselective cyclohexylamine oxidases using droplet-based HTS methods, where horseradish peroxidase couples the activity of the oxidase with the fluorogenic dye Amplex UltraRed.18
Another approach to enzyme HTS focused on selectively labelling cell surfaces with a fluorophore in order to screen active variants within a library. This strategy relies on cell-surface display of active enzymes and a subsequent enzyme-coupled assay that triggers labelling of the cell surfaces. Some of the earliest examples of applying this HTS method to directed evolution generated an enantioselective esterase by displaying esterases and peroxidases on the cell surface of E. coli. To do so, the different enantiomers were fluorescently labelled and when the esterase was active, the fluorophore from the substrate was released. This enabled the peroxidase to covalently bind it to cell-surface proteins. Finally, the positive variants were sorted using fluorescence activated cell sorting (FACS).19 Further research adapted this technology to the yeast S. cerevisiae and combined it with microfluidics to evolve glucose oxidase (GOx). Cells expressing a library of randomised variants of the enzyme were emulsified in single water-in-oil microdroplets together with the substrate (glucose), a reporter enzyme (HRP) and a fluorescent substrate for the reporter enzyme (fluorescein tyramide), which was covalently linked to the cell surface of yeast when hydrogen peroxide was produced by GOx. After incubation with the enzyme cascade, the oil phase was removed and the labelled cells were analysed using FACS.20 The use of microfluidics inhibited crosstalk and allowed the use of a longer enzyme cascade without requiring the display of both components. However, since the fluorophore had to be covalently linked to the cell-surfaces, lower signal amplification was observed.
Implementing enzyme cascades in evolutionary workflows not only allows detection of a large variety of products but can also amplify signals and provide easily detectable products with a high signal to noise ratio.21 Moreover, the introduction of cascades avoids the accumulation of products in the reaction vessel and can mitigate issues such as product inhibition and product toxicity.22 Furthermore, enzyme cascade readout systems can avoid the requirement of using chemically modified substrate analogues, avoiding bulky fluorescent groups and allowing enzyme variants to be screened on natural substrates.23 Finally, multi-step enzyme cascades offer opportunities to increase screening throughput by in some cases providing an optically readable output that can be evaluated at higher speed and throughput than conventional chemical analysis such as mass spectrometry or liquid chromatography. Enzyme cascades when combined with novel reaction compartmentalization strategies have the potential to enable more efficient evolution workflows. Based on these advantages, the establishment of novel enzyme cascades can broaden the scope of possible enzyme targets that can be screened and studied by directed evolution.
Beside these well-established compartments used in HTS, new methods are regularly being developed to increase throughput and sensitivity of the screening process. Some of the more recent examples include screening in microcapillary arrays,27 microbeads,28 or liposomes.29 Microcapillary array screening offers the advantage of cell spatial separation comparable to MTP platforms but with significantly higher throughput. This method uses a sorting method based on a pulsed ultraviolet laser, and extracts cells from a microcapillary with high selectivity and viability. The versatility of this screening method has been shown in the engineering of binders, fluorescent proteins and enzymes. The authors point out several distinctions between high-throughput FACS-based screening and microarray capillary screening. Some of them include the possibility of distinguishing enzyme variants based on reaction kinetics instead of a single point fluorescent intensity, as well as the possibility of direct cell imaging and decoupling cell analysis and sorting.27
Recently a FACS-based HTS system using E. coli cells as natural compartments was reported for monoamine oxidase, where all the above-mentioned considerations were made.31 Sadler et al. used an acetylated fluorescein derivative as an indirect reporter probe that could diffuse into the cell, where the intracellular esterases cleaved acetyl groups leaving the probe susceptible to oxidation in a presence of H2O2 and endogenous peroxidases. This approach generated a fluorescent compound (Fig. 1c) that served as the signal for single-cell sorting. This screening method was shown to be versatile, and many different substrates could be screened using the same assay. A similar approach was used for directed evolution of P450 BM3 monooxygenase, in which 7-benzoxy-3-carboxycoumarin ethyl ester underwent intracellular de-esterification and subsequent dealkylation by P450, forming a fluorescent coumarin derivative.32
In developing approaches that leverage appropriate substrate/product pairs, linking fluorescent protein expression to enzyme activity can simplify and increase the throughput of screening assays. One such example was demonstrated in the work of Chen et al., in which the authors developed a screening platform based on ligand-mediated eGFP expression.33 They constructed a 2-vector E. coli expression system, where one vector carried the enzyme of interest (arginine deiminase) and the other vector carried the genes for eGFP (expressed under argG promoter) and argR which acts as a repressor of the argG promoter in presence of arginine. This system relied on the competitive conversion/binding of arginine between arginine deiminase and the arginine repressor. When inactive or low activity enzyme variants were expressed, arginine bound to argR and together they repressed eGFP biosynthesis, while expression of high-activity enzyme variants depleted arginine resulting in an increase in eGFP biosynthesis (Fig. 1d).
One more example of fluorescent protein expression linked to enzyme activity was reported by Sanchez and Ting34 for directed evolution of TEV protease toward increased kcat. They expressed TEV protease in the yeast cell and used a TEV cleavage sequence linked to the transcription factor, which was subsequently released and translocated to the nucleus to start transcription of a reporter protein, citrine. This screening method for TEV protease differed from a previous method reported by Yi et al.,35 who screened a TEV-P library for substrate specificity using yeast surface display. Yi applied endoplasmic reticulum (ER) sequestration, which allowed for simultaneous expression and localization of both enzyme and substrate library in the ER, followed by substrate surface display upon enzyme cleavage.
Another screening strategy relying on yeast surface display was developed for the directed evolution of bond-forming enzymes36,37 such as microbial transglutaminase, an enzyme with potential for antibody–drug conjugate synthesis. Deweid et al.37 displayed microbial transglutaminase on the yeast surface and used intrinsic lysine residues to form an isopeptide bond with a biotinylated oligopeptide. This scheme led to enzyme auto-labeling which enabled the screening of mutant libraries using increased selective pressure by reducing substrate availability.
Fluorescent proteins are not necessarily only used as reporters for enzyme activity, but their applicability as carriers for non-canonical amino acids in library screening of p-cyano-L-phenylalanyl aminoacyl-tRNA synthetase (pCNFRS) has been shown in the work of Kwok et al.38 The authors used a strain-promoted azide–alkyne click (SPAAC) reaction to distinguish superfolder GFP with incorporated p-azido-L-phenylalanine (pAzF), and based on the reaction selectivity they successfully evolved pCNFRS to preferentially incorporate pAzF from the mixture of pAzF and p-cyano-L-phenylalanine (pCNF).
The challenge of emulsion polydispersity has been addressed using well-controlled microfluidic-based emulsion production methods. Microfluidic systems allow for highly-controlled water-in-oil droplet emulsification, and allows reagent addition by droplet fusion or micro-injection followed by droplet sorting.42 Fallah-Araghi et al. used a microfluidic system to compartmentalize single genes of β-galactosidase and amplify them by PCR before fusing the droplets with an IVTT mix and a fluorogenic substrate. Although microfluidic sorting rates can in some cases be 10-fold lower than typical FACS sorting rates, microfluidics offer a high level of control over the reaction volumes and conditions. Combining microfluidic-based water-in-oil and later water-in-oil-in-water emulsions with FACS sorting can significantly improve the speed of sorting as well as the enrichment factor.43
Microfluidic droplet-based screening relies typically on fluorogenic substrates, or alternatively on fluorescent reporter protein expression, an approach called affinity-fed translation (AFD).44 By using an enzyme that produces an amino acid, it was possible to control the expression of a reporter protein in an aqueous droplet with IVTT. The sensitivity of screening was improved by expressing the enzyme of interest and reporter protein simultaneously. Very recently a novel detection method was also introduced that coupled microfluidic screening and sorting with mass spectrometry,45 which is very powerful as it is chemically generalizable.
Analogous to substrate diffusion into the cell, researchers have investigated substrate delivery through the oil phase into aqueous droplets to precisely control the start of the enzymatic reaction and minimize background fluorescence for highly active enzymes.46,47 One of these works developed fluorescence droplet entrapment (FDE) substrates for three classes of enzymes (phosphotriesterases, esterase and glucosidases).47 The authors investigated the hydrophobicity of fluorogenic substrates in terms of permeability through water-in-oil-in-water emulsions, cell membranes, and diffusion outside the inner aqueous droplets using logD values as an evaluation parameter. In their work, Ostafe et al.46 used a substrate delivery system for glucose oxidase expressed on the yeast surface. Yeast cells were entrapped in water droplets and β-octylglucoside was added to the primary emulsion, where it underwent enzymatic cleavage by externally added β-glucosidase generating glucose. After the glucose became available for glucose-oxidase, cells harboring active variants were covalently labeled with fluorescein-tyramine and extracted from the emulsion droplets. Extraction of the covalently labeled cells simplified the FACS analysis compared to sorting water-in-oil-in-water double emulsions. Further improvements in screening enzyme variants with higher catalytic efficiency instead of the overall activity could be obtained by normalizing fluorescent signals to the expression levels of the enzyme. Normalization was done with either antibody labelling of the enzymes expressed on the cell surface,46 or using co-expression with a reporter gene.48,49
Microfluidic approaches are also compatible with high-throughput screening and enzyme evolution using whole cells in droplets and emulsions.18,50,51 Cell recovery and post-screening analysis is more straightforward than for IVTT systems, with all the benefits that microfluidic emulsification brings. One of the first successful fluorescent activated droplet sorting (FADS) experiments with whole cells was conducted on a model library of β-glucosidase expressed in E. coli,52 which was later used for sorting of the horseradish peroxidase library expressed on the yeast surface, enabling discovery of an enzyme variant with nearly diffusion-limited catalytic efficiency.53 Besides fluorescent detection, absorbance activated droplet sorting (AADS) was developed and validated for whole cells in microfluidic droplets by sorting a phenylalanine dehydrogenase library.21 Absorbance as a detection method can significantly extend the scope of enzymatic assays that can be used in directed evolution, despite the lower sensitivity (compared to fluorescence) afforded by short microscale path lengths.
In a work from our group, we developed hydrogel-based enzyme activity assays using Fenton chemistry to generate polymerization initiators.57–59 To adapt these approaches to library screening, we developed a hydrogel encapsulation system for screening GOx libraries expressed on the yeast surface for increased enzyme activity and stability.60 Cells expressing active enzyme variants were encapsulated in fluorescent alginate carrying phenol moieties that polymerized in the presence of H2O2 and HRP (Fig. 1e). By screening for variants that could encapsulate the cells following exposure to a denaturing agent, variants with higher stability and activity could be sorted and isolated by FACS. The main advantage of this system is that it allows screening of enzyme libraries in a pooled fashion. Since the radicals generated to initiate the polymerization reaction have limited stability in biological media, the polymerization remains localized to the cell surface. This represents a reaction-based compartmentalization approach and enables one-pot library screening, greatly increasing throughput. Other bottlenecks in throughput including transformation efficiency and FACS determine the ultimate throughput of such systems. This hydrogel high-throughput screening represents the first system used for the direct screening of enzyme stability by flow cytometry, obtaining GOx variants with 13 to 15% increased thermal stability compared to the wild type enzyme. In addition, several advantages were shown when the alginate hydrogel was used for cell encapsulation compared to the previously described method based on Fenton chemistry.54 Alginate hydrogels are thick and robust, protecting the cells from osmotic lysis and allowing size-based filtration of encapsulated cells.55 Besides, the reaction mixture doesn’t require multiple monomer and polymer components, but a single fluorescent polymer. Future work in our group on HRP-mediated alginate polymerization is focusing on screening for alternative reaction chemistries using enzymatic cascades that generate H2O2 as the final reaction product.
One of the first comprehensive mutagenesis scans on an enzyme was presented by Gray and colleagues, who screened variants of a dehalogenase enzyme using a multi-well plate assay to identify single mutants with higher thermostability.61 Since then, analogous screening workflows were applied to the detection of stability enhancing mutations of other enzymes such as xylanase62,63 and phytase64 and similarly to the study of enantioselectivity of a nitrilase catalyst by combining an MTP assay with mass spectrometry.65 Recently Fulton and colleagues reported a systematic study on a lipase A to determine the effect of single mutations on its detergent tolerance.66 Another leading example of linear scanning of sequence space supported by MTP screening was presented by Van der Meer and colleagues.67,68 In this work, single mutants of the enzyme 4-oxalocrotonate tautomerase (4-OT), a promiscuous catalyst of carbon–carbon bonding reactions, were screened for enhanced Michael-type addition activity and improved enantioselectivity. In addition, selected mutations impacting the enantioselectivity of the catalyst were combined, favoring the expression of multiple mutant variants producing products with further improved enantiopurity.
Typically, the first comprehensive scanning mutagenesis methods applied to enzymes focused mainly on the investigation of catalysts that could provide direct survival advantages or detectable phenotypic changes making the screening of the variant library compatible with high throughput technologies (i.e. plate survival assays and single cell sorting) without any further adaptation required. Pioneering works were the publication of a nearly complete functional map of the T4 lysozyme generated by amber suppression and tested through a plating plaque-forming assay.69 Several other works focused over the years on the study of beta lactamase TEM-1 enzyme that provides resistance against B-lactam antibiotics conveying survival advantage as the selection mechanism.70–72 Other recent examples include mutagenic scanning studies on VIM-2 lactamase that involved scanning-based library construction methods combined with NGS,73,74 as outlined below.
DMS requires the analysis of several thousands of variants in a single campaign. For the evolution of binders, this approach has become quite common, relying on methods such as cell surface, ribosome and phage display as expression platforms to enable DMS. However, for enzymatic reactions the comparatively lower throughput of enzyme screening assays has resulted in fewer reports of the application of DMS. For some enzyme screening methods, the requisite library sizes for DMS (typically on the order of 104 variants) are still too large.
Nevertheless, almost a decade ago the first investigation of an enzyme fitness landscape supported by deep sequencing appeared in literature.76 The authors targeted TEM1 beta lactamase and screened mutant libraries of the enzyme through selection on agar plates containing a fixed concentration of antibiotic. Sequencing of selected variants revealed positions with different inclinations to accept amino acid changes without impacting the enzyme activity. Along the same lines, Firnberg and colleagues reported mutational scanning of TEM-1 by screening a nearly comprehensive single-mutant library at 13 different ampicillin concentrations, thereby generating a detailed overview of the effects of each amino acid substitution on the overall protein fitness at different levels of selective pressure.77 Moreover, the authors analyzed and reported effects of silent DNA mutations on the stability and functionality at both RNA and protein levels, a novel aspect addressed later on in other works.78,79 Starita and colleagues adopted an auto-ubiquitination assay of phage displayed variants to explore the fitness landscape of E3 ubiquitin ligase.80 By deep-sequencing the library before and after selection, the authors found 25 single amino acid mutations that enhanced activity. Many mutations were located far from the catalytic site, and would have been difficult to predict using classical focused or random approaches for enzyme engineering at active site and first shell residues.
In work combining microfluidic approaches and DMS, Romero and colleagues developed a microdroplet encapsulation method for single cells expressing glucosidase variants that they coupled with fluorescence sorting. This DMS work revealed the effect of amino acid mutations on enzyme fitness and, by screening the mutant library following thermal stress, the authors discovered mutations enhancing enzyme thermostability.81
Coupling survival or growth rate of the host (i.e., selection) is also readily combined with DMS-based readout methods for characterizing phenotypic fitness. Klesmith and colleagues developed a microbial strain able to grow on levoglucosan as the sole organic source by linking the activity of the investigated enzyme, levoglucosan kinase, to the fitness of the host organism.82 Similarly, the activity of an RNA guided endonuclease was linked to the survival of microbial cells through inducible expression of a toxic DNA gyrase inhibitor, which served as a selection system that was then used for DMS of a CAS9 enzyme.83
DMS was also specifically applied to the study of sequence determinants that impact protein solubility. TEM-1 and levoglucosan kinase activities were abolished through site directed mutagenesis and the effects of mutations on translation and folding of the proteins addressed. By coupling these new data to previous findings on activity for the same catalysts the authors built a fitness landscape including both properties and confirmed that shared mutations impacting positively both activity and solubility in an enzyme are rare.84
With the large datasets provided by DMS, researchers have begun to address more fundamental questions about enzyme function and evolution, such as how substrate choice can impact evolutionary trajectories of enzymes or how the evolutionary trajectory is influenced by the strength of the selective pressure. Along this line of research, Melnikov and colleagues demonstrated how the fitness landscape of an enzyme varies significantly depending on the nature of the selection system. They constructed 6 parallel fitness landscapes of a Tn5 transposon-derived kinase that confers resistance to antibiotics and used 6 structurally distinct substrates at increasing concentrations for the selection.85 The authors identified protein residues responsible for orthogonal activity on different substrates without any additional support from structural or in silico analyses. Furthermore, TEM-1 beta lactamase was again used as a model enzyme by Stiffler and colleagues to study the connection between evolvability and robustness of a fitness landscape. By exposing a comprehensive single mutant library of TEM-1 to selective pressure at increasing concentrations of its natural substrate ampicillin as well as on a new substrate cefotaxime, the authors concluded that the robustness of a sequence (i.e. its capacity to accept and tolerate mutations without impacting the function) strongly depends on the strength of the selection used and that its divergent evolvability towards the use of new substrates is facilitated at lighter selective conditions rather than under strong selective pressure.86 This intuitively makes sense under the consideration that most mutations are neutral or deleterious.
A key interest among protein biochemists is to gain a deeper understanding of how enzymes encode substrate specificity. Several directed evolution campaigns have been successful in tuning or changing substrate specificity of catalysts. Nevertheless, these works typically explore mutations in the vicinity of the substrate binding pocket and ignore mutations at distal residues. In recent work, Wrenbeck and colleagues applied DMS to study the fitness landscape of an amide hydrolase, linking its activity to the growth rate of the host cells and screening a nearly comprehensive single mutant enzyme library using three different amides as substrates.78 This work showed that mutations beneficial for a specific substrate are often not proximal to the catalytic site, once more demonstrating the advantages of systematic scanning methods such as DMS in comparison to random or rational approaches. Furthermore the authors concluded that screening of an enzyme mutant library using different substrates produces unique fitness landscapes with profound differences, emphasized even more when molecules with significant structural divergence are used.
Currently, DMS supported by NGS and high-throughput screening platforms represent the most advanced pipelines to engineer enzymes and explore enzyme sequence and function. Nevertheless, the application of these methods to the study of enzymes suffers from a mismatch in throughput, with state-of-the-art NGS throughput surpassing the best and fastest screening methods by several orders of magnitude. In fact, while we have witnessed over the past years steady advancements in quality and efficiency of sequencing technologies, no striking breakthroughs have been registered for the development of equally important high-throughput screening for many categories of enzymes. This throughput bottleneck has motivated the development of computational and machine learning approaches that can be trained on limited experimental data and interpolate accurate phenotypes from input sequences, as described below.
The generic pipeline of a protein engineering campaign supported by machine learning consists of the generation of experimental data representing sequence-phenotype pairs, and training a statistical or machine learning model to predict phenotypes from input sequences never assayed before. The phenotypic property of interest can be chosen from several features, including catalytic properties,98,99 substrate affinity, stability,100–102 and expression level in the host organism. Prominent examples of machine learning assisted protein engineering include membrane channel engineering,103,104 protein structure prediction105,106 and protein–protein interactions.107–109 Although we cannot hope to comprehensively cover this exciting and rapidly developing field in this review, we would like to outline below some basic considerations in applying machine learning with a focus on appropriate methodology for enzyme engineering.
In order to apply machine learning to enzyme engineering, the amino acid sequences need to be converted and represented by numerical arrays. Different methods are available to accomplish this task, from the tabulations of single amino acid physical parameters (solubility, charges, pKas, etc.) to combinations of such parameters such as in AAindex,110 a collection of indices, mutation matrices and statistical protein contact potentials; or T-scale,111 which uses principal component analysis to reduce the dimensionality of topological and structural data of 135 amino acids. Another successful and well established method, because of its simplicity, is to convert each amino acid into bit-based vectors that can be more (one-hot encoding) or less sparse (binary numbers). Representations of amino acids can even be actively learned in so-called embeddings.112
Next, a model typology needs to be selected/chosen according to the nature of the problem. If the fitness function studied behaves in a continuous and ungroupable space, a regression model may be appropriate. Otherwise, if the discrimination between different fitness categories of the studied population is well defined, the model of choice would be a classifier. An approach could be to start with the implementation of simple linear models and move to more complex ones if non-linearity is needed to describe the system under study.
Another aspect to consider for the establishment of the ML workflow is the quantity of data that needs to be processed. Kernel based methods such as support vector machines are well suited for handling hundreds of data points while neural networks are more suitable if the amount of data is in the order of hundreds of thousands or even millions of experimental sequences.
A key step consists of the separation of the data into different sets. The set referred to as the training set includes the majority of the data (usually 70–80%) and is used to train the model and learn the best parameters in order to predict the response variable of interest. The second set referred to as the validation set is the second most populated one (20–10%) and it allows a balance of the complexity of the model, known as the hyper-parameters. This is crucial to avoid underfitting, where the true behavior of a system is not described sufficiently, as well as overfitting, where the model fits the training set extremely well but fails to generalise to other points. Last but not least, it is important to save a small portion of data (∼10%), which must never be used during the training, to test the goodness of the model to interpolate and, eventually, extrapolate to unexplored sequence space. After the generation of a working trained model, in silico screening of a large number of different candidate sequences not present in the initial dataset can be used to evaluate and rank candidates. Subsequently, the best candidates discovered in silico can be synthesised in the lab and characterized, and eventually included in future iterations of model training. This approach can ultimately lead to massive savings in resources, money and time. For further explanations on model heuristics and exemplary case studies, we refer the reader to the work of Yang and colleagues.113
An early theoretical work from Fox114 reported application of partial least square (PLS) regression coupled with genetic algorithms to improve directed evolution outcomes. Similarly another PLS based algorithm was implemented in more recent work by Cadet and colleagues94 where researchers processed the data with Fast Fourier Transformation (FFT) and used PLS to develop a predictive model for the improvement of epoxide hydrolase.
The technique used by Romero and coworkers115 involved optimizing thermostable and active P450 enzymes using a Gaussian process. Specifically, they fit a model using a library containing 261 sequences. Using this model, they were able to identify and synthesize a variant with a ∼9 °C improvement in thermal denaturation temperature when compared to a previously engineered variant obtained by classical directed evolution. The Gaussian process has the advantage of including uncertainty in regions that were not explored experimentally, however it is computationally expensive and training time scales poorly for large datasets.
We have identified several trends representing state-of-the-art modern methods for enzyme engineering and evolution including concepts such as novel reaction cascades for readout, approaches to reaction compartmentalization to achieve a genotype–phenotype link, and systematic scanning mutagenesis and deep mutational scanning approaches for generating mutability maps. Improvements in ultrahigh-throughput enzyme screening technology and DNA sequencing have significantly increased the amount of experimental data that can be obtained from directed evolution experiments, however, big data is extremely costly and time consuming to obtain. The availability of computational tools such as machine learning to extract the most value from these datasets will acquire more importance with time as screening capacity, DNA sequencing throughput and computational power increase. In such a scenario, machine learning approaches will considerably improve directed enzyme evolution by substantially lowering time and resources needed to achieve a desired activity level, or by significantly increasing the performance of enzymes that are engineered at a fixed cost level. By manipulating genes encoding catalytic enzymes, protein engineers can push these molecules to new levels of fitness and stability and help them reach their full potential.
Footnote |
† Authors contributed equally. |
This journal is © The Royal Society of Chemistry 2022 |