V. Diana Rakotonirinaa,
Marco Bragatob,
Stefan Heinenc and
O. Anatole von Lilienfeld*acdefgh
aDepartment of Materials Science and Engineering, University of Toronto, St. George Campus, Toronto, ON, Canada. E-mail: anatole.vonlilienfeld@utoronto.ca
bFaculty of Physics, University of Vienna, Kolingasse 1416, AT1090 Wien, Austria
cVector Institute for Artificial Intelligence, Toronto, ON M5S 1M1, Canada
dChemical Physics Theory Group, Department of Chemistry, University of Toronto, St. George Campus, Toronto, ON, Canada
eML Group, Technische Universität Berlin and Institute for the Foundations of Learning and Data, 10587 Berlin, Germany
fBerlin Institute for the Foundations of Learning and Data, 10587 Berlin, Germany
gDepartment of Physics, University of Toronto, St. George Campus, Toronto, ON, Canada
hAcceleration Consortium, University of Toronto, Toronto, ON, Canada
First published on 23rd October 2024
We study the applicability of the Hammett-inspired product (HIP) Ansatz to model relative substrate binding within homogenous organometallic catalysis, assigning σ and ρ to ligands and metals, respectively. Implementing an additive combination (c) rule for obtaining σ constants for any ligand pair combination results in a cHIP model that enhances data efficiency in computational ligand tuning. We show its usefulness (i) as a baseline for Δ-machine learning (ML), and (ii) to identify novel catalyst candidates via volcano plots. After testing the combination rule on Hammett constants previously published in the literature, we have generated numerical evidence for the Suzuki–Miyaura (SM) C–C cross-coupling reaction using two synthetic datasets of metallic catalysts (including (10) and (11)-metals Ni, Pd, Pt, and Cu, Ag, Au as well as 96 ligands such as N-heterocyclic carbenes, phosphines, or pyridines). When used as a baseline, Δ-ML prediction errors of relative binding decrease systematically with training set size and reach chemical accuracy (∼1 kcal mol−1) for 20k training instances. Employing the individual ligand constants obtained from cHIP, we report relative substrate binding for a novel dataset consisting of 720 catalysts (not part of training data), of which 145 fall into the most promising range on the volcano plot accounting for oxidative addition, transmetalation, and reductive elimination steps. Multiple Ni-based catalysts, e.g. Aphos-Ni-P(t-Bu)3, are included among these promising candidates, potentially offering dramatic cost savings in experimental applications.
On the computational front, traditional methods require separate calculations for each complete compound, which is inefficient for large-scale exploration. Efforts to reduce the cost of discovering new compounds led to the rapid growth of machine learning (ML),10,11 and the combination of experimental and computational techniques via self-driving labs12,13 are emerging. The rise of ML has revolutionized this field by significantly reducing the computational cost of predicting compound properties compared to ab initio methods such as density functional theory (DFT), and as pointed out in ref. 14 and 15. In homogeneous catalysis, ML techniques such as random forest and linear regression have been used to predict catalyst reactivity,16–19 while kernel ridge regression (KRR) and neural networks have successfully modeled binding energies for the Suzuki–Miyaura (SM) cross-coupling reaction,20,21 relevant in drug synthesis.22–24 Additionally, descriptors, or representations, capturing parameters essential in inferring a system's properties have been extensively investigated for different models.15,25–29 Despite the advances, these models often require extensive computations for each catalyst, highlighting the need for a combinatorial strategy that can efficiently explore the catalyst space by integrating the contributions of various building blocks, such as ligands and metals, to optimize performance.30,31
Linear free energy relationships, such as the Hammett equation,32,33 can be harnessed to build models able to partition systems into fragments, thus eliminating the combinatorial complexity. Introduced in 1935,32,33 it has been recognized as a simple but accurate tool for separating substituent and reaction effects on free energy changes. It was first proposed for benzoic acid derivatives and was subsequently shown to be generalizable. Applications to heterocyclic compounds,34 metal–ligand complexes,35 and structure–reactivity relationships in cross-coupling reactions36 have been reported. Improvements to better account for steric effects also exist, such as the Taft or the Charton equations.37–42 The use of these established parameters is however hindered by the unavailability of measurements under consistent conditions for a larger library of substituents. To overcome this, Sterimol parameters have been developed, which instead rely on geometric coordinates43 and have proved useful in asymmetric catalysis.44,45 In contrast, Bragato et al.46 introduced a Hammett-inspired product model (HIP) able to fit parameters to diverse chemistries and properties without the need for external references or geometries. By establishing an internal reference, it also allows for the inclusion of diverse environments for each substituent in the fitting process, resulting in more balanced constants. Some of its successful applications in catalysis include the prediction of adsorption energies of small carbon molecules.47
For further partitioning of substituent effects, we turn to combination rules. The Hammett equation, detailed in the following section, was developed for singly-substituted compounds. However, we deal with complexes containing multiple ligands in organometallic catalysis, so access to the effect of each ligand and ligand combination is essential. Early studies on disubstituted and trisubstituted benzene derivatives using the Hammett equation indicated additive substituent effects under minimal steric inhibition of resonance.48 Diverse combination rules have also been employed for estimating thermodynamic properties of mixtures.49–51 These include using the arithmetic mean for pure component properties to estimate collision diameters52 and the geometric mean for potential well depths53 in the Lennard-Jones potential. The harmonic mean is used for second virial coefficients,54 and the sixth-power mean for rare gas systems.55
In this work, we introduce an approach that partitions the contributions of the metal and each ligand in organometallic catalysts using the SM reaction as a test case. This method facilitates computational ligand tuning through binding energy predictions and their implementation into volcano plots. We assess and propose methods for retrieving and combining individual ligand effects that ensure statistically stable calculations. The combining rule is integrated into a HIP model46,47 (Fig. 1b). Utilizing a dataset of 25k oxidative addition relative binding energies, we also investigate the performance of this combination rule-enhanced HIP model (cHIP) as a baseline for Δ-ML,56 which learns residuals and further mitigates excessive data needs. Subsequently, we show how the design flexibility afforded by cHIP can be used to expand a second, smaller catalyst dataset, DB2, into DB3 and conduct screening (Fig. 1c).
(1) |
Although empirical σ values for common substituents in the ionization of benzoic acid derivatives can be found in the literature, such parameters are absent for many ligands pertinent to homogeneous catalysis. Furthermore, caution is warranted when assuming the transferability of these values for chemistries of different natures.58 Note that in eqn (1), the necessity of establishing a ligand as a reference is due to the ligand space being larger compared to metals. The HIP model, as introduced by Bragato et al.,46,47 enables the investigation of similar reactions by extracting linear scaling factors between them and eliminating reliance on external references. It comprises the following 3 steps:
ΔElm ≃ ρmσl + ΔE0m | (2) |
Furthermore, the binding energy changes can be stored in a matrix M, where X is the number of metals and Y is the number of ligand groups,
(3) |
For each column, the internal reference ΔE0m is defined as its median.
cmnρn − ρm = 0 | (4) |
cmn is computed as the slope of the line of best fit between all ΔElm and ΔEln for 1 ≤ l ≤ Y (see Fig. 2). The procedure adopts Theil–Sen regression,59,60 evaluating the median of pairwise slopes (ΔElm – ΔEkm)/(ΔEln − ΔEkn) for 1 ≤ l < k ≤ Y. Ordering the slopes in a list S, we get
(5) |
Fig. 2 Binding energies for complexes with metals m and n. Each point represents a distinct ligand combination. The regression line, obtained through Theil–Sen regression,46,59,60 has a slope representing the median of all pairwise slopes. |
This method offers the advantage of being more robust towards outliers, but it scales quadratically with the number of data points.Accounting for all permutations of m and n, m ≠ n, eqn (4) yields an overdetermined system of linear equations Cρ = 0 which can be used to solve for ρs. A comprehensive description of this matrix is provided in the supplementary information† section (SI).
(6) |
In order to reconfirm the validity of this approach we revisited the previously published data. In particular, the effect of a group of ligands l is approximated as
(7) |
D = σl | (8) |
Results on display in Fig. 3 confirm our expectation that σs of combinations are additive in σs of single ligands. These experimental constants were obtained from studies on the hydrolysis of phosphonium salts,61 alcoholysis of isocyanates,62 and various reactions involving benzoic acids.48 The datasets encompassed di- and trisubstituted compounds featuring small substituents such as Cl, CH3, OCH3, NO2, etc., each with established Hammett parameters. Furthermore, note that the sums of parameters obtained through linear regression (Fig. 3b) are consistently closer to the experiments than the sums of Hammett parameters. This enhanced accuracy of the least squares method, presumably due to improved regularization and balancing, suggests its capability to robustly account for environment-specific synergistic effects, typically absent within the arbitrarily selected experiments resulting in the initial Hammett parameters. We also report the performance of other combination rules in the ESI,† which revealed less accuracy than the additive rule.
Fig. 3 Test of combination rule for experimentally obtained substituent parameters describing reactions of various chemistries (see text) published decades ago by Siegel,61 Kaplan,62 Jaffé.48 Substituent constants () were obtained by (a) summing published σ, and (b) after linear regression for each of the three published sets (this work). |
For estimating relative binding energies to catalyst complexes in the SM coupling reaction we have first applied eqn (8) to ligand pairs using the σs obtained from HIP as σl. Then, cHIP predictions, ΔEc, can be obtained for any ligand pair ij and catalyst metal m via
ΔEijm ≃ ΔEcijm = ρmij + ΔE0m = ρm(σi + σj) + ΔE0m | (9) |
The cHIP results served as a baseline for Δ-ML with KRR.56 KRR, a supervised ML technique initially introduced in chemistry for learning molecular atomization energies,63 maps data from the input space to a feature space and calculates the dot product of the transformed vectors. The mapped data in this case are the representations, which contain information derived from the molecular structure. For a given set of N training instances, the corrected predictions after Δ-ML are obtained using
(10) |
(11) |
(12) |
α = (K + λI)−1y | (13) |
Purpose | Catalyst structure | Metals | # of ligands | Reaction steps | # of data entries | Data origin | |
---|---|---|---|---|---|---|---|
DB1 | Test combination rule | Li − Mm − Lj | Ni, Pd, Pt, Cu, Ag, Au | 91, see ESI | Oxidative addition | 25116 | 7054 from DFT20 + 18062 from KRR using MBDF and Laplacian kernel (this work) |
DB2 | Catalyst discovery | Li − Mm − Li | Ni, Pd, Pt, Cu, Ag, Au | 16, see ESI | Oxidative addition, transmetalation, reductive elimination | 276 | 276 from DFT (92 per reaction step)64 |
DB3 | Novel catalysts | Li − Mm − Lj | Ni, Pd, Pt, Cu, Ag, Au | 16, see ESI | Oxidative addition, transmetalation, reductive elimination | 720 | cHIP (this work) |
The first one, referred to as Database 1 (DB1) was obtained from ref. 20. It contains a total of 91 ligands of types phosphines (P), N-heterocyclic carbenes (NHC), pyridines (Py), and other common ligands (Other). Their chemical structures are provided in the SI. Those ligands were combined with six transition metals (Ni, Pd, Pt, Cu, Ag, and Au) to form catalysts having the structure Li − Mm − Lj, where Li and Lj are ligands and Mm is a metal, spanning a total set of 25116 compounds. It contained relative binding energies relative to the oxidative addition step in the SM C–C cross-coupling reaction depicted in Fig. 1a for 4186 Li – Lj combinations, each coupled with all 6 metals. The distributions of the energies by metal and by ligand type combinations, shown in Fig. S1 and S2† respectively, primarily show that the energies are more strongly correlated with metals than with ligands.A subset of 7054 geometries in DB1 was optimized using the AiiDA automated platform65 at the B3LYP-D3/3-21G66,67 level of theory for the Ni, Pd, Cu, and Ag complexes, and B3LYP-D3/def2-SVP68 for the Pt and Au complexes in Gaussian09.69 Subsequently, B3LYP-D3/def2-TZVP70 single-point calculations were performed. The remaining 18062 energies were predicted by us using a KRR model trained on the DFT energies using the Many-Body Distribution Functionals (MBDF)27 representation and a Laplacian kernel, providing full coverage of all ligand–metal combinations at a reduced computational cost. Three other representations and kernel combinations were built for comparison, namely Coulomb Matrix15 and Bag of Bonds (BoB)26 with a Laplacian kernel, and Spectrum of London and Axilrod–Teller–Muto potential (SLATM)25 with a Gaussian kernel, hence reproducing Meyer et al.'s work.20 As shown in Fig. 4, the comparison leads to the conclusion that MBDF, with the Laplacian kernel, yields the best result.
The second and smaller dataset (Database 2 or DB2) was obtained from ref. 64 and provides relative binding free energies for all 3 intermediate steps depicted in Fig. 1a. The values include unscaled enthalpies and vibrational entropy contributions. It contains symmetrical complexes, composed of the same 6 metals as in the first dataset and 16 ligands. 10 out of those 16 ligands were also present in the first one, and the energies of 4 Cu-based complexes were missing, yielding only 92 complexes per reaction step. The geometries of these complexes were optimized using M06/def2-SVP68,71,72 in Gaussian09 (ref. 69) while accounting for solvation in tetrahydrofuran using the implicit SMD model.73
Fig. 4 Test of ML model used to augment the DFT data in DB1 (Table 1). Learning curves (test error of relative binding energies vs. training set size) for oxidative addition for different ML methods. L: Laplacian kernel, G: Gaussian kernel. The augmentation was done using the MBDF-based model trained on all 7054 DFT training points published in ref. 20. |
Note how in Fig. 5, no skewing is observed (see also the error distribution curve per metal in Fig. S5†). As displayed as insets, outliers correspond to varying metals. However, they all have in common that there is one P ligand. Furthermore, all underestimated outliers correspond to catalysts that share the same ligands, proazaphosphatrane, and 2-fluoropyridine.
Most outliers of the cHIP model (Fig. 5) were also outliers already for the HIP model (Fig. S3†). This suggests that, beyond the decrease in dimensionality caused by the combination rule, these shortcomings are likely to be caused by the Hammett equation's inadequacy in describing those specific ligands. This inadequacy can only be partly explained by steric hindrance since the dataset also comprised several bulky NHC ligands which were not outliers. We also note that since many systems used for fitting are sterically hindered, some steric effects could be included in the cHIP parameters.
Performing the regression on subsets according to the categories of each of its ligands further improved the prediction accuracy. In Fig. 8, all the cHIP models fitted on the subsets had lower errors than the full dataset model. These observations agree with the statement that a Hammett correlation occurs between closely related species.35 They are also in line with the fact that regression is more difficult the higher the dimensionality.74 As such, cHIP promises to be a useful model for combinatorially scaling spaces (the combinations of ligands in this case) with low-dimensional chemistry-specific properties, and where limited training instances are available.
The individual σs were retrieved by applying the combination rule on the HIP predictions to produce the Hammett plot in Fig. 6. Moreover, the distinct clustering of complexes by their central metal and the linear trends in Fig. 6 validate the adequacy of the partitioning of ρ and σ.
Fig. 6 Demonstration of cHIP. Oxidative addition relative binding energies (DB1, see Table 1) are shown as a function of averaged σ obtained from cHIP. For the sake of simplicity, only 5 examples from each ligand type combination (given as legends) are plotted for each of the three metals. Lines correspond to cHIP predictions. |
A closer look at the calculation of the individual σs for DB1 in Fig. 7 revealed a MAE of 2.5 kcal mol−1 between the σs from HIP and cHIP. While an overestimation is usually observed with the additive rule when summing existing Hammett parameters,48,75 we have circumvented this by fitting those parameters with linear regression. The corresponding error distribution of cHIP σs (inset of Fig. 7) illustrates the welcome absence of any bias.
Fig. 7 Test of combination rule for individually obtained substituent parameters using DB1 (Table 1). Averaged (cHIP on DB1) vs. reference σ (HIP on DB1) for combinations of ligands i and j. The inset shows the corresponding error distribution. |
Fig. 8 Mean absolute error of predicting relative binding energies in the oxidative addition step as a function of training set size. Errors of cHIP models on subsets (colored) and full DB1 (Table 1) plateau rapidly. ML corresponds to the KRR/MBDF line shown in Fig. 4. Δ-ML model trained on DB1 results with cHIP baseline results in lower offset and enables convergence to chemical accuracy. Error bars indicate standard deviations. |
The fitting of HIP to the entire dataset, illustrated in Fig. 9a, resulted in MAEs consistently below 3.5 kcal mol−1 for all steps. Leveraging the effects of 16 single ligands, cHIP successfully predicted the ligand effects of 120 new ligand combinations. Pairing these with each of the 6 metals, we predicted relative binding free energy changes for an additional 720 catalysts, reported in DB3.
Fig. 9 Volcano plot: negative relative binding free energies of each step are plotted as a function of that of the oxidative step (oxi). Ideal catalysts lie at the top of the volcano and vertical lines represent the ideal range as identified by Busch et al.64 (a) catalysts sourced from DB2 (Table 1). (b) novel catalysts predicted using cHIP, the 198 most interesting catalysts lie in between the two intersections of the transmetalation with either oxidative addition or reductive elimination. Two inexpensive (Ni-based) catalysts in the optimal regime are indicated by arrows. |
As depicted in Fig. 9b, 198 of the new catalysts lie at the top of the predicted volcano plot. Among these, 145 displayed oxidative addition relative binding free energies ranging from −34.0 to 17.0 kcal mol−1, previously identified as an optimal range by Busch et al.64 This combinatorial approach revealed several Ni-based catalysts approaching the top of the volcano after ligand tuning, despite the initially strong-binding nature of Ni. The discovery of these catalysts, derived from a metal that is more cost-effective and earth-abundant than the prominent Pd, is particularly attractive and has been actively explored in the field over the past decade.30,78 When considering the cost of the ligands, the most cost-effective catalyst identified by cHIP is Aphos-Ni-P(t-Bu)3, representing about 67% of the cost of the least expensive catalyst found in DB2, Pd(Ace)2, based on ligand and metal prices provided by Sigma-Aldrich.79 Similar phosphine ligands for Ni catalysts have been reported in literature such as ProPhos80 or P(Cy)3.81
Notably, the model was able to capture a reasonable trend across both datasets, despite DB1 lacking solvent effects and DB2 incorporating implicit solvation. This suggests that the cHIP model could potentially make equally reliable predictions when trained on datasets that include explicit solvation effects.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00228h |
This journal is © The Royal Society of Chemistry 2024 |