Yutao
Kuang
,
Junshan
Lai
and
Jolene P.
Reid
*
Department of Chemistry, University of British Columbia, 2036 Main Mall, Vancouver, British Columbia V6T 1Z1, Canada. E-mail: jreid@chem.ubc.ca
First published on 17th January 2023
Organometallic intermediates participate in many multi-catalytic enantioselective transformations directed by a chiral catalyst, but the requirement of optimizing two catalyst components is a significant barrier to widely adopting this approach for chiral molecule synthesis. Algorithms can potentially accelerate the screening process by developing quantitative structure–function relationships from large experimental datasets. However, the chemical data available in this catalyst space is limited. Herein, we report a data-driven strategy that effectively translates selectivity relationships trained on enantioselectivity outcomes derived from one catalyst reaction systems where an abundance of data exists, to synergistic catalyst space. We describe three case studies involving different modes of catalysis (Brønsted acid, chiral anion, and secondary amine) that substantiate the prospect of this approach to predict and elucidate selectivity in reactions where more than one catalyst is involved. Ultimately, the success in applying our approach to diverse areas of asymmetric catalysis implies that this general workflow should find broad use in the study and development of new enantioselective, multi-catalytic processes.
On the basis of our recent efforts to deploy comprehensive multidimensional analysis to develop and leverage general mechanistic models,28,29 we became interested in investigating if our multireaction workflows can be embedded in the optimization and quantitative prediction of reaction systems involving two catalysts (Fig. 1A). In this approach the features of all the reaction components are correlated to the experimentally obtained enantioselectivity outcomes conveyed as ΔΔG‡ using density functional theory (DFT) calculated descriptors and multivariate linear regression (MLR). Essentially, these techniques permit the development of mechanistically informative correlations providing the basis to transfer enantioselectivity outcomes to predict the impact of reaction components not included in the initial training correlation, like a new catalyst or substrate.
In the context of this study, the success of this approach to predict synergistic catalysis would be contingent on the model's ability to transfer the stereochemical information from one or more reactions facilitated by a single catalyst to another mechanistically similar process that involves two catalyst systems. Although multi-catalyst reaction designs share some common mechanistic features with single catalyst systems (e.g., optimal chiral catalyst structure), comparative studies that would reveal reaction specific contacts and general connections have not been performed. Such investigations would be valuable for formalizing mechanistic principles and considering the limits of model generality. As a result, despite the practical appeal of an approach that would preclude the requirement for explicit chemical data on synergistic reaction systems, the applicability domain of such statistical models would likely be challenging to estimate. To that extent, we envisioned that the general mechanisms of stereoinduction in asymmetric catalysis should extend to multi-catalyst reaction strategies that focus on combining a chiral catalyst with a reactive intermediate that is generated from a second achiral catalyst species. Because these types of transformations are not significantly affected by the presence of a second catalyst, they should be particularly amenable to our modelling approach that uses data from one or more reactions to predict the result of a similar system (Fig. 1B).
To effectively assess how broadly applicable this approach could be we decided to select three case studies that encompass different modes of organocatalysis. Since statistical models describing the nucleophilic additions to iminiums catalysed by chiral phosphoric acids and phosphates are easily accessible through previous reports these were both identified as suitable case studies for an initial evaluation. The second criteria in selecting a predictive platform is determining a chiral catalyst system that has been widely used in synergistic catalyst space such that significant validation data exists. Consequently, we identified reactions involving secondary amines as a third study. To this end, we develop and deploy MLR as a transferability method to achieve quantitative predictions and mechanistic analysis in diverse synergistic catalyst space.
In seeking an ambitious and relevant first test, we selected the hydrogenation of imines using molecular hydrogen.31 Since this process cannot be simply facilitated solely by chiral organic molecules,32 approaches have focused on two catalyst systems. Beller and co-workers demonstrated that Knölker's complex, a simple achiral iron hydrogenation catalyst, can be used in combination with a chiral Brønsted acid to provide enantioenriched secondary amines.31 Considering the overlap in structural features of the reaction components we anticipated that a previously generated statistical model constructed of chiral phosphoric acid catalysed additions of nucleophiles to imines could be deployed to predict the reaction outcomes (Fig. 2).30 In the previous study, reaction performance was first evaluated using a comprehensive model built from the entire data set constructed of reactions that proceeded through two different pathways, an E (+ee) or Z (−ee) transition state. While prediction errors were typically larger with this all-inclusive model, its use is imperative to determine the stereochemistry of the final product and the pathway under operation. Use of this inclusive model is particularly important for predicting reactions involving ketimines as these can progress through either the E or Z iminium geometries making it difficult to determine the favoured reaction pathway. This is in contrast to imines derived from aldehydes which have been shown to proceed solely through structures possessing the E configuration.11 Following the E or Z mechanistic assignment, the prediction accuracy can be improved by applying the mechanism specific models (E or Z). Because two different data sets are used in the construction of these statistical models the included terms and final predictions are slightly different. For example, the comprehensive model emphasizes solvent (black), imine (blue), nucleophile (green) and catalyst (red) terms distributed over six parameters, as contributors to the enantioselectivity across seventeen reaction types. These parameters capture the general steric and electronic influences of the individual reaction components on the experimental outcome allowing for accurate out-of-sample prediction and further mechanistic interpretation. Focused correlations can then be produced by modeling only a subset of these reactions to reveal more intricate mechanistic details through better feature selection. In other words, truncating the data set will facilitate the identification of the structural features that affect particular mechanistic pathways (E or Z) allowing better predictions to be achieved. Although having this option to model only portions of the data has been proven to be beneficial, it is not common or necessary. The comprehensive model does not naturally allow for the prediction of stereochemistry but the product configuration can be assigned by applying the simple models displayed in Fig. 2. These are based on the amine product generated from an E or Z TS and catalysed by the (S)-CPA. The standard steps for ee prediction and the assignment of product stereochemistry include: (1) locating the ground state of the targeted reaction variable by DFT, (2) obtaining the key molecular features necessary for prediction, and (3) submitting these to both mathematical equations.
To put it generally, the application of these models to synergistic reactions requires the second achiral catalyst to be featurized as an electrophile or nucleophile. This can be determined by considering the reaction mechanism and the structures involved in the key transition states. On this basis, to predict the enantioselectivity of this reaction type the achiral iron complex is categorized as the nucleophile and the necessary parameters, the H–X–CNu (the nucleophile angle measurement) and B5Nu (the nucleophile steric descriptor) parameters are to be collected from this structure. For consistent results, the same level of theory that was used to optimize the nucleophiles incorporated in the original model should be applied to optimize the iron complex (organometallic nucleophile); however, the M06-2X density functional that was implemented is not applicable for molecules containing metals.33 Instead, we employed M06/def2TZVP calculations which are suitable for organometallic systems, and when tested against a subset of nine nucleophiles, geometry minimizations provide the same value for the key bond angle and Sterimol B5 nucleophile terms when compared to M06-2X/def2TZVP (average deviation was calculated to be 0.8° and 0.02 Å, respectively). Confident that this adjustment would not significantly impact the results, we optimized the iron complex with this set of computational conditions and collected the necessary Sterimol B5 and angle nucleophile parameters for prediction from this structure. In other words, this computational method comparison suggested that a predictive model built from M06-2X descriptors could be used to predict the impact of hydrogenation reactions given the key nucleophile parameters calculated at the M06 level. Taking these steps, we evaluated the twenty reported hydrogenations involving aromatic imines catalysed by TRIP. Both the catalyst and most of the imines were part of the published training set making the nucleophile (achiral iron complex) the only component not to be explicitly included. Each result was predicted using the comprehensive model, with an average absolute ΔΔG‡ error of 0.64 kcal mol−1 (14 examples within 5% ee) and the absolute stereochemistry was correctly assigned as S, demonstrating the ability of the model to extrapolate effectively to an organometallic nucleophilic intermediate (Fig. 3). A slightly improved outcome is observed using the Z-imine mechanistic model with a 0.48 kcal mol−1 average error (15 examples within 5% ee). Considering this, the average error is slightly inflated and inspection of the maximum errors shows that this is due to two reactions that had prediction errors >1 kcal mol−1. In these cases the reaction performed less well than expected considering the conditions employed are associated with very high enantioselectivities. Accordingly, the model cannot capture the few results where the general enantioselectivity trends do not translate. Mechanistically, the ability to extrapolate to complex multi-catalytic reaction space suggests that transition state features like the arrangement of the reactants and hydrogen bonding contacts to the catalyst are similar to those found in one catalyst systems. This observation is consistent with previous computational studies which show an iron-phosphoric acid mediated hydrogenation with a similar substrate.34
Fig. 3 Extrapolation of a previously reported chiral Brønsted acid imine reaction model to iron catalysed hydrogenation. |
Inspired by these successful results, we selected to further evaluate the generality in our observations by investigating chiral phosphate catalysed reaction systems.29 In considering this we noted that the addition of 2-naphthols to gold(I) activated allenamides exhibited overlapping transition state features with our previously built statistical model, i.e., combines an iminium with a nucleophile in the presence of a chiral phosphate (Fig. 4).35 As with the chiral phosphoric acid study, to deploy the published chiral phosphate model to predict the impact of utilizing an organometallic intermediate as a reaction component, the sensitivity of the previously identified parameters to the computational method must be taken into account. Guided in part by the proposed transition state, the second achiral catalyst species was to be combined with the allenamide to form a cationic intermediate which undergoes nucleophilic addition in the enantiodetermining step. As such, the necessary parameters NBOS (the NBO charge of the first atom on the smallest iminium group), Sterimol B1 of the large iminium group (B1L) and polarizability that represent the electrophile components are to be collected from these organometallic structures. Because we expected NBO charges to be sensitive to the computational methods employed, we re-optimized each iminium intermediate at the M06/def2TZVP level to ensure that all electrophilic components (organic and organometallic) were treated uniformly. After replacing the iminium parameters (NBO, Sterimol B1 and polarizability) calculated at M06-2X/def2TZVP with those acquired from M06/def2TZVP optimized structures, the statistical model was re-created in MATLAB using the same enantioselectivity data and identified parameters from the previous publication (Fig. 4).29 This model could then be deployed to predict the organometallic data set (Fig. 5A). However, this is a more challenging scenario, as the structural overlap between the training and the prediction set is slightly reduced. More specifically, the reaction components to be predicted are not explicitly included in the training data but belong to general subclasses of iminiums, naphthols, and chiral phosphates. Following calculations, the key iminium parameters (NBO, Sterimol B1 and polarizability) were collected from the electrophilic gold complex and inputted into the model for prediction. Again, accurate predictions were construed with the statistical model with an average absolute ΔΔG‡ error of 0.33 kcal mol−1 (4 examples within 2% ee and 6 examples within 5% ee). Like the previous test, one poorly predicted reaction (with an error of around 1 kcal mol−1) inflated the average mean error. Again, this situation arises from a reaction that should be high-performing given the general enantioselectivity trend. The presence of a methyl group had the most detrimental effect on selectivity (4% ee experimental and −11% ee predicted) and excitingly, this could be accurately captured by the model (Fig. 5B). While it is superficially surprising that the model can successfully anticipate significant enantioselectivity changes due to minor substrate modifications (i.e., switching a phenyl for a methyl), close examination of key parameters in the model reveals that the lower enantioselectivity for this substrate can be attributed to the more positive NBO (Fig. 5B, gold intermediate A). This is intriguing as in both cases the atom remains a hydrogen suggesting the model is describing a subtler effect on the enantioselectivity outcome. Perhaps the most powerful analysis of the model is illustrated by comparing the substrate profiles of the one (chiral phosphate only)36 and two (chiral phosphate combined with gold) catalyst systems. Remarkably, the optimal iminium intermediate was reversed between the two methods. In other words, the lead substrate with chiral phosphate catalysis failed to provide high enantioselectivities under the gold conditions and vice versa. Once again, the model clearly explains why certain substrates should be particularly amenable to different protocols. Under chiral phosphate catalysis the NBOS values are comparable for the two substrates (0.246, Ph and 0.247, 4-CF3Ph) and the difference in polarizability, the second important iminium term, explains the contrast in enantioselectivity. Under gold catalysis, the more negative NBO values associated with gold intermediate B, largely compensate for a slightly lower polarizability term, and the ee is increased (Fig. 5B). This demonstrates that gold binding to the substrate and changes to the N-substituent (i.e. switching a Ph to a Me) alters the electron density of the iminium hydrogen, a key site for establishing non-covalent interactions with the catalyst,29 ultimately suggesting that the iminiums ability to engage in weak hydrogen bonding interactions is attenuated allowing for these substrate effects to emerge. The ability of the model to accurately reflect the outcomes with different substrates suggests that it could be applied to guide successful reaction scope extension. Based on the previous evaluation and the included model parameters we considered that the model could be reliably broadened to include 1-naphthols as the nucleophile component. By deploying the descriptor set and the training model, the resultant extrapolation of the nucleophile space predicted only moderately good enantioselectivities, an observation validated by new experiments (Table 1). This result is compelling in that we could reach an informed decision about pursuing 1-naphthols as a substrate class while also providing a useful starting point for further reaction optimization. Ultimately, this demonstrates that the model’s capabilities are not limited to classifying and predicting literature data sets but can be applied to analyze and predict new reactions even in complex multi-catalytic reaction space. Because the models are only capable of predicting enantioselectivities, one limitation to acknowledge is that the tools can only guide users in substrate scope expansion where the desired reactivity is more certain. This is exemplified by the fact that no reactivity was observed with 2-phenylcyclohexanone under our conditions although these substrates work effectively with chiral phosphate only protocols (see the ESI†).37
Entry | CPA | L | Yieldb (%) | eec (%) | Predicted ee (%) |
---|---|---|---|---|---|
a Reactions were run with the following conditions: allene substrate (0.15 mmol), naphthol (0.1 mmol), gold phosphine (5 mol%), chiral silver phosphate (5 mol%), toluene (1 mL), rt, 16 h. b Isolated yields are given. c Enantioselectivities (ee) were measured by HPLC. | |||||
1 | 9-Anthryl | PPh3 | 68 | 76 | 62 |
2 | 9-Anthryl | JohnPhos | 66 | 74 | 67 |
Despite the potential for extensive catalyst structure modulation, only a limited set of secondary amines have witnessed broad application. This is in contrast to the many other catalyst chemotypes employed in asymmetric synthesis where necessary and extensive optimization efforts have generated considerably sized catalyst libraries. Thus, the most significant challenge in the early stages of implementing our workflow was defining a useful data set containing both high and low enantioselectivities for model construction. Consequently, to supplement our data mining efforts on published data from scientific journals, we explored the use of publicly available PhD theses. Because the reported data meet the degree requirements for characterisation PhD theses typically contain experimental data of high quality. But they remain unpublished presumably because the research objective of delivering the product in high enantioselectivity was not met.38 Accordingly, we postulated that these data could be a targetable source of negative results required for robust model building. Throughout this literature evaluation, we strategically avoided two types of reaction examples. First we ignored reactions that showed product racemization to be strongly contributing to the overall enantioselectivity outcome (i.e. time and temperature sensitive).39,40 In the absence of strongly supporting experimental data, it is only possible to minimize rather than eliminate the influence of such effects in our analysis through the removal of unusual experimental results.41 Consequently, some proportion of variation between measured and predicted enantioselectivity values will likely be attributable to these factors in addition to experimental and analytical error. Secondly, we did not include examples that combined proline type catalysts with reactants that did not contain strongly electronegative atoms. In these cases, the structure of the reactant would make it difficult to determine if hydrogen bonding was directing its approach and therefore, hard to assign the mode of enantioinduction (in more detail below).
On this basis, to construct a predictive model, an expanded inventory of 452 reactions with varied components was curated from multiple sources (for a list of references see the ESI†). From this survey, we categorized the dataset by general catalyst structure which is a significant factor in determining the mode of enantioinduction (steric blocking or hydrogen bonding) wherein imidazolidinone42 and diphenylprolinol ethers43 are grouped by a +ee value and proline type catalysts, a −ee value (Fig. 6). Therefore, the sign of the ee represents one of two transition state (TS) categories, depending on the catalyst involved. This is important in understanding the product enantioselectivities, because reactant addition to the top or bottom face will produce opposite enantiomers. Accordingly, the statistical model will be able to determine whether the reaction proceeds through steric blocking (predicts + ee value) or hydrogen bonding modes of enantioinduction (predicts −ee value) and this information can be used to make predictions about the absolute configurations by using the qualitative pictures shown in Fig. 6. Furthermore, the TS-guided categorization strategy is useful in producing a well-distributed data set which would be hard to achieve by not taking into account the absolute product stereochemistry and demonstrated to be an effective technique in other studies.29,30 Next, a diverse array of molecular descriptor values were collected from DFT optimized geometries to describe the overlapping structural features of each electrophile, nucleophile, catalyst, solvent, and co-catalyst.44,45 Because this model is being built with the sole aim of predicting reactions involving an organometallic intermediate, we naturally choose the appropriate computational methods from the beginning. This involved optimizing the reactant at the M06/def2TZVP level and all other components with M06-2X/def2TZVP. The commonality in the substrate and catalyst substructure allowed collection of natural bond orbital (NBO) charges and Sterimol values from the conserved portions. However, the nucleophile had a minimal structure overlap; thus, polarizability, highest occupied molecular orbital (HOMO), and lowest unoccupied molecular orbital (LUMO) energies, which do not rely on common substructures, were collected to describe this component. Unfortunately, the lack of consistency in the reaction conditions renders the identification of readily comprehensible and extensive parameter sets for the remaining components a challenge. For example, several reactions required Brønsted acid co-catalysts and employed solvent mixtures while many others did not. Guided by the proposed mechanism of catalysis, we postulated that in cases where the acid additive was absent, the proton could originate from another source, a reagent or catalyst, and relevant descriptors could be collected from these components. Because solute–solvent interactions with polar substances will likely dominate over those with non-polar molecules, we collected topological, two-dimensional descriptors from the solvent with the largest dielectric constant (see the ESI†).46,47
Prior to model building, the data set was partitioned into 80:20 training:validation sets using MATLAB's deterministic equidistant splitting function. Linear regression algorithms were then applied to the training set (80% of the entire data set that incorporates both + ee and −ee reactions) to identify prospective correlations between the molecular structure of every reaction variable defined by the parameters collected in the previous step of the workflow and the measured enantioselectivity, ΔΔG‡ (where ΔΔG‡ = −RTln(e.r.) and T is the temperature at which the reaction was performed). Since the training set includes significant diversity in the reaction component structure and mechanism, we anticipated that several descriptors would be required to achieve predictive correlations. Using forward stepwise linear regression48 we determined a model that includes solvent (black), substrate (blue), reactant (green), and catalyst (red) terms distributed over thirteen parameters to be appropriate. Despite the high R2 value and validation scores, a relatively small number of outliers appeared at around 0 kcal mol−1 on the x-axis. Essentially, these correspond to a few reactions that provided almost racemic mixtures in the experiment. Such unique reaction features will not conform to trends revealed by comprehensive MLR models as these operate by linking reactions via general connections i.e., structural effects that apply to the majority of reactions included in the data set.
Previous computational studies show that enantioselectivity arises from the geometry of the enamine/iminium double bond (s-trans or s-cis) and the approach of the reactant (top or bottom).49 Therefore, it is possible that the mathematical model also reveals some of these mechanistic features despite the complex equation. Notably, the catalyst descriptors have the largest coefficients in the normalized equation, demonstrating that stereocontrol is dominated by catalyst architecture for this class of reactions. The presence of B5C1(up) lends to a straightforward analysis by implying that larger substituents at this position makes reactant approach to the double bond from the top less possible. Additionally, large groups at C1 would direct the double bond to occupy the opposite side of the catalyst thereby favoring the formation of the s-trans isomer. We interpreted the inclusion of the NBOC1 term as a categorical descriptor that essentially highlights that proline type catalysts (typically negative NBOC1 and −ee) direct the reactant to the top via a hydrogen bonding interaction whereas steric blocking catalysts (usually positive NBOC1 and +ee) that incorporate large alkyl or aromatic groups at this position promote reaction on the opposite face. Importantly, the presence of B5C1(down) with a negative coefficient likely indicates that TSmajor is also sensitive to the catalyst features. In other words, larger substituents at this position may enhance repulsive interactions between the catalyst and reactant in the TS that forms the major product, ultimately favouring the formation of the opposite enantiomer. Indeed, imidazolidinone catalysts that have two large groups at the C1 position generally provide lower levels of enantioselectivity supporting this assertion. The s-trans/s-cis isomer ratio also depends on the substrate, and having inversely sized groups on either side of the carbonyl will strongly reinforce the preference for the s-trans. This is expressed by the LS model term and reflected in the lower enantioselectivities obtained for ketones compared to aldehydes. Because ketones are predominantly used in combination with proline (i.e., resulting in −ee reactions) the coefficient associated with LS is positive. The role of the reactant is described through polarizability which likely acts as a proxy for chemical size (see the ESI†) and the HOMO energy. The relationship also includes three solvent parameters with relatively small coefficients, suggesting that most solvents are compatible and the assortment of optimal solvents is a reflection of reaction component solubility.50,51 Although relating the correlation to previous findings demonstrates that the model also provides insightful mechanistic information, another important test is to remove correlated parameters and replace these with dummy values. This test is more important in this example as many parameters are being used to fit the data. Model building with these meaningless reaction barcodes shows the statistical scores to be much worse (R2 = 0.69, Q2 = 0.65, LOO = 0.63, and 10-fold CV = 0.62), suggesting that the descriptors are correlating something meaningful (see the ESI† for more details).
Considering that the goal of the prediction analysis is to transfer enantioselectivity trends of one catalyst systems to multi-catalysis, the next step is validating the secondary amine model’s ability of extrapolating to new reaction types involving a single catalyst (Fig. 7). Thus, for each out-of-sample prediction platform, both the catalyst and substrate are contained in our training set (see the ESI† for additional out-of-sample predictions). It should be noted that the model can only make predictions about the reaction enantioselectivity and not the diastereoselectivity if two stereocenters are created. This second aspect of selectivity arises from the orientation of the reactant substituents relative to those on the enamine/iminium, and this may be governed by a different set of molecular features. However, diastereoselectivity is typically not the experimental output that requires significant optimization and high levels are usually observed regardless of the reaction conditions. Thus, the prediction of diastereoselectivity is not crucial for reaction development, providing an incentive to exclude this output from our regression analysis. As a first assessment, we evaluated the ability to predict nine hydroxylation reactions, involving an oxime and diphenylprolinol ether.52 This set was predicted accurately, with an average absolute ΔΔG‡ error of 0.12 kcal mol−1 (eight examples predicted within 5% ee). By using the simple reaction model presented in Fig. 6, the absolute configuration is correctly assigned as R. As a second case study, the model was assessed in the same manner with seven proline catalyzed amination reactions involving an azodicarboxylate.53 Again accurate predictions were obtained with this statistical model (ΔΔG‡ error of 0.19 kcal mol−1, six examples predicted within 5% ee) with the qualitative diagram shown in Fig. 6 confidently determining the stereochemical outcome to be R. The small maximum observed error in both cases shows that all of the reactions were well predicted by the model.
Fig. 7 Demonstration of mechanistic transferability by predicting enantioselectivity outcomes involving the hydroxylation and amination of aldehydes. |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2sc05974f |
This journal is © The Royal Society of Chemistry 2023 |