Hannes
Kneiding
a,
Ruslan
Lukin
a,
Lucas
Lang
a,
Simen
Reine
a,
Thomas Bondo
Pedersen
a,
Riccardo
De Bin
b and
David
Balcells
*a
aDepartment of Chemistry, Hylleraas Centre for Quantum Molecular Sciences, University of Oslo, P.O. Box 1033, Blindern, 0315 Oslo, Norway. E-mail: david.balcells@kjemi.uio.no
bDepartment of Mathematics, University of Oslo, P. O. Box 1053, Blindern, Oslo, Norway
First published on 23rd March 2023
Machine learning can make a strong contribution to accelerating the discovery of transition metal complexes (TMC). These compounds will play a key role in the development of new technologies for which there is an urgent need, including the production of green hydrogen from renewable sources. Despite the recent developments in machine learning for drug discovery and organic chemistry in general, the application of these methods to TMCs remains challenged by their higher complexity and the limited availability of large datasets. In this work, we report a representation for deep graph learning on TMCs – the natural quantum graph (NatQG), which leverages the electronic structure data available from natural bond orbital (NBO) analysis. This data was used to define both the topology and the information expressed by the NatQG graphs. At the topology level, two different NatQG flavors were developed: u-NatQG, with undirected edges, and d-NatQG, with edges directed along donor → acceptor orbital interactions. At the information level, the node and edge attribute vectors of both graphs contain NBO data, including natural charges and bond orders. The NatQG graphs were used to develop graph neural networks (GNNs) for the prediction of the quantum properties underlying the structure and reactivity of TMCs (e.g. HOMO–LUMO gap and polarizability). These models surpassed baselines based on traditional descriptors and performed at a level similar to, or higher than, state-of-the-art GNNs based on radial cutoffs. The results showed that the electronic structure information encoded by the models has a stronger impact on its accuracy than the geometric information. With the aim of benchmarking the GNNs, we also developed the transition metal quantum mechanics graph dataset (tmQMg), which provides the geometries, properties, and NatQG graphs of 60k TMCs.
A key advantage of molecular graphs is their direct connection to skeletal formulae (Fig. 1), which can be regarded as the most universal language used by chemists. When combined with graph neural networks (GNNs),54 the resulting models have achieved state-of-the-art accuracy in the prediction of various properties.55 Further, in the context of explainable AI,56–59 the interpretation of the GNNs60–62 can refer to a skeletal formula, providing interpretations that are immediately intuitive. GNNs and related graph-based methods have also succeeded in other challenging tasks, including the generation and inverse design of molecular systems.63–65
Transition metal complexes (TMCs) are a diverse family of compounds, including bioinorganic, Werner, and organometallic complexes, with key applications in multiple fields including catalysis66 (e.g. synthesis of fine chemicals), nanomaterials67 (e.g. electronic devices), medicinal chemistry68 (e.g. anticancer drugs), and renewable energies69 (e.g. photosensitizers). The development of accurate GNN models for the discovery and design of new TMCs with optimal properties is motivated by the strong societal impact of these applications. In line with this, there is a growing interest in the development of data-driven approaches to the study of TMCs and their applications.70–79
For organic compounds, the derivation of molecular graphs is straightforward (Fig. 1) and can be done from different inputs, including geometries and line notations (e.g. SMILES80 and SELFIES81). In line with this, most GNN models have been developed for, and tested on, organic molecules, often in the field of drug discovery.3 In contrast, TMCs are more difficult to express as graphs due to the metal d orbitals, which yield larger valences and multi-center bonds. In this context, the representation of a TMC can become ambiguous, with multiple possible graphs of different topology. This may include disconnected graphs limiting the applicability of GNNs. Fig. 1 illustrates this problem for the Zeise's salt, the first historical example of a metal–olefin complex.82 Graph generation from either line notations or geometries does not fully solve this problem – the former either don't support or are not robust for TMCs, and, from the latter, it is difficult to define the atomic connectivity. Nonetheless, geometric information is highly valuable and it has been used successfully to inform several graph representations with the aim of increasing the accuracy of GNN models.83 In contrast, the use of electronic structure information for the same purpose remains largely unexplored,84,85 despite its availability from geometry optimization calculations and its low computational cost.
In this article, we introduce the natural quantum graph representation (NatQG) for TMCs and its implementation into GNN models based on message-passing algorithms.86 These models leverage the inductive bias provided by natural bond orbital (NBO) theory,87 which transforms the quantum wave function into a set of localized molecular orbitals (i.e. the NBOs) corresponding to the electron pairs of a Lewis structure. In the context of this theory, second-order perturbation analysis87 (SOPA) yields the nature and strength of the interactions between pairs of NBO orbitals based on their energy difference and overlap. The NBO and SOPA data were used to define the topology and inform the nodes and edges of undirected (u-NatQG) and directed (d-NatQG) molecular graphs for TMCs (Fig. 2 and 3, respectively), which were used in the prediction of their quantum properties with GNNs.
![]() | ||
Fig. 2 Derivation of the Zeise's salt u-NatQG graph. Abbreviations used for the NBO orbitals: LP = lone pair, LV = lone vacancy, BD = bonding, BD* = antibonding. ΛBO = natural bond order threshold. |
![]() | ||
Fig. 3 Derivation of the Zeise's salt d-NatQG graph. Abbreviations used for the NBO orbitals: LP = lone pair, LV = lone vacancy, BD = bonding, BD* = antibonding. |
With the aim of benchmarking the GNNs, we developed the transition metal quantum mechanics graph (tmQMg) dataset, which provides the NatQG graphs of 60k TMCs together with their DFT geometries and properties. For most properties, the accuracy of the NatQG GNNs surpassed that of other models, including graphs that were either informed with classical descriptors or built from cutoff radius. This includes the HOMO–LUMO gap of the TMCs, which underlies several TMC properties of high interest, including conductivity, photochemistry, and thermal stability. The present work also showed how the electronic structure data from a single-point calculation of the energy can be leveraged in machine learning models to predict expensive quantum properties requiring the calculation of energy derivatives, including the polarizability and the thermodynamic corrections.
In this work, we used NBOs and their donor–acceptor interactions to derive two types of natural quantum graphs (NatQG) differing in the nature of their edges, which are either undirected (u-NatQG) or directed (d-NatQG). There is no node redundancy in either graph (i.e. each node represents a single atom of a TMC), and both contain geometric information (i.e. bond distances). For generating the graphs, we developed the Hylleraas deep graph learning (HyDGL) program, with code openly available at https://github.com/hkneiding/HyDGL.
In this study, both the geometries and the NBOs were computed with DFT methods. However, these properties can also be obtained at lower levels of theory; e.g. NBOs can be calculated with DFTB methods,88 reducing the computational cost by two orders of magnitude. It should be noted that changing the level of theory at which these properties are obtained may affect the accuracy of the ML models in which the u-NatQG and d-NatQG representations are leveraged.
The graph topology resulting from the NBOs has a major drawback – it can be disconnected (Fig. 2e) and, therefore, in a GNN, message passing cannot span the whole graph regardless of the model depth. The disconnectedness arises from the Lewis structure generated in the NBO calculation, which can exclude some of the metal–ligand bonds yielding isolated fragments (e.g. ethylene in the Zeise's salt). This problem was solved by defining a natural bond order threshold (ΛBO). After applying the ΛBO ≥ 0.05 bonding condition to all possible metal–atom pairs, the Zeise's salt u-NatQG graph became fully connected (Fig. 2f). This ΛBO value was set after inspecting graph connectedness over the 60k TMCs included in the tmQMg dataset (vide infra).
After defining the topology, the u-NatQG graphs are informed with attribute vectors expressing the NBO electronic structure (Fig. 2g). At the node level, these attributes include the natural atomic charge, valence index, and electron configuration, whereas the edges are attributed with the natural bond order. In addition, both the nodes and the edges encode features of the LP/LV and BD/BD* NBO orbitals, respectively, including orbital type, number, energy, electron occupancy, and symmetry (i.e. spd hybridization). Table 1 provides a systematic list and further details of the u-NatQG attributes.
Attribute | Description |
---|---|
a Atomic charges and valences from NBO analysis. b Wiberg-based. c In the natural electron configuration. d This and all other LP and LV attributes are set to zero when the node is not associated to these NBO types, and the same approach is applied to the energy gap when there is a single LP or LV. e Percentage of orbital character in NAO basis (hybridization). f Either BD or three-center (3C) NBOs. g This and all other BN and BN* attributes are set to the graph-average values for the edges build with the ΛBO ≥ 0.05 condition. h Restricted to NBOs of the same type. i Counting BD*, 3Cn, and 3C* orbitals. | |
Nodes | |
Z | Atomic number |
N H | Number of H atoms attached to the node |
q Nat | Natural atomic chargea |
V Nat | Natural valence indexa,b |
N VEl | Number of s, p, and d valence electrons; NVEl = (Ns, Np, Nd)c |
N LP | Number of lone pair (LP) NBOsd |
E LP | Energy of the highest-lying LP |
O LP | Electron occupancy of the highest-lying LP |
S LP | s, p, and d orbital symmetries of the highest-lying LP; SLP = (sLP, pLP, dLP)e |
ΔELP | Energy gap between highest- and lowest-lying LP |
N LV | Number of lone vacancy (LV) NBOs |
E LV | Energy of the lowest-lying LV |
O LV | Electron occupancy of the lowest-lying LV |
S LV | s, p, and d orbital symmetries of the lowest-lying LV; SLV = (sLV, pLV, dLV)e |
ΔELV | Energy gap between lowest- and highest-lying LV |
![]() |
|
Edges | |
BO | Natural bond orderb |
d | Bond distance |
T BN | Bonding NBO (BN) type; i.e. 2-center or 3-center (one-hot encoded) |
N BN | Number of bonding NBOsf |
BNE | Energy of the highest-lying BNg |
O BN | Electron occupancy of the highest-lying BN |
S BN | s, p, and d orbital symmetries of the highest-lying BN; SBN = (sBN, pBN, dBN)e,g |
ΔEBN | Energy gap between lowest- and highest-lying BNh |
N BN* | Number of non- and anti-bonding NBOs (BN*)i |
E BN* | Energy of the lowest-lying BN*g |
O BN* | Electron occupancy of the lowest-lying BN* |
S BN* | s, p, and d orbital symmetries of the lowest-lying BN*; SBN* = (sBN*, pBN*, dBN*)e,g |
ΔEBN* | Energy gap between lowest- and highest-lying BN*h |
In principle, the combination of NBO- and ΛBO-based edges (eNBO and eΛ, respectively) for enforcing the connectedness of u-NatQG (Fig. 2f) would yield heterogeneous graphs with attribute vectors of different dimensionality, challenging their exploitation in GNN models. This issue is caused by the different amount of data available in each case – whereas all orbital parameters are available to inform the eNBO edges, yielding a total of eighteen dimensions (Table 1), for the eΛ edges only two dimensions can be defined (the bond order and distance). This problem was solved by informing eΛ with the same eighteen dimensions of eNBO, using the graph-averaged values to assign the unknown orbital parameters. It should be noted that, in practice, the amount of eNBO edges is ca. ten times larger than that of eΛ edges (vide infra).
The SOPA data was used to build the directed d-NatQG graphs, in which the interacting node pairs (ni, nj) are connected with directed ni → nj edges accounting for ni-to-nj donor–acceptor interactions.
Fig. 3 shows the derivation of the d-NatQG graph of the Zeise's salt. For the bonding between platinum and ethylene, the SOPA yields a BDCC → LVPt interaction for the π → d donation from the ligand to the metal center, and an LPPt → BD*C
C interaction for the d → π* backdonation from the metal center to the ligand. In d-NatQG, these interactions are expressed with a directed graph topology including Pt ⇄ C edges, in which the relationship expressed in one direction, Pt ← C (BD-to-LV donation), is different from that expressed in the opposite direction, Pt → C (LP-to-BD* backdonation). When an atom pair is involved in multiple donor–acceptor interactions in either one direction or in both, d-NatQG accounts only for the strongest (i.e. the one yielding the largest E(2) value). In order to avoid a redundant excess of edges, the latter are only added to the graph if they represent an interaction with E(2) > 1 kcal mol−1.
Once the d-NatQG graph is built, it is informed with electronic structure information. The node attribute vectors contain the same NBO data used in the u-NatQG graphs. In contrast, the edge attributes are mostly extracted from the SOPA, including the orbital type, energy, occupancy, and symmetry of the donor and acceptor NBOs. Further, the bond order and the maximum and average values of E(2) are included. Table 2 provides a systematic list and further details of the d-NatQG attributes.
Attribute | Description |
---|---|
a E(2)MAX when there is a single interaction. b This and all other properties are for the NBOs yielding the strongest donor–acceptor interaction for the node pair connected by the edge (i.e. largest E(2) value in the SOPA). c Restricted to NBOs of the same type. | |
Nodes | |
Z, NH, qNat, VNat, NVEl | As described for u-NatQG in Table 1 |
![]() |
|
Edges | |
BO | Natural bond order |
d | Bond distance |
E(2)MAX | SOPA stabilization energy for the strongest donor–acceptor interaction |
E(2)Avg | Average of the SOPA stabilization energiesa |
T D | Donor NBO type;bi.e. LP, BD, or 3C (one-hot encoded) |
E D | Energy of the donor NBO |
O D | Electron occupation of the donor NBO |
S D | s, p, and d orbital symmetry of the donor NBO; DSym = (Ds, Dp, Dd) |
ΔED | Energy gap between lowest- and highest-lying donor NBOc |
T A | Acceptor NBO type;bi.e. LV, BD*, 3Cn, or 3C* (one-hot encoded) |
E A | Energy of the acceptor NBO |
O A | Electron occupation of the acceptor NBO |
S A | s, p, and d orbital symmetry of the acceptor NBO; ASym = (As, Ap, Ad) |
ΔEA | Energy gap between lowest- and highest-lying acceptor NBOc |
In contrast with u-NatQG, the connectedness of the d-NatQG is guaranteed by the SOPA analysis, without requiring the definition of a threshold. However, from a skeletal formula perspective, d-NatQG is more exotic, with missing edges in positions where there are covalent bonds (e.g., in the Zeise's salt, between the two carbon atoms of the ethylene ligand). In terms of explainability, this may make the d-NatQG graphs less intuitive though it should be also noted that they express, with directionality, the fundamental interactions commonly used by chemists to conceptualize the structure and bonding of TMCs, including π-backdonation.
In addition to the electronic structure attributes of Tables 1 and 2, both the u-NatQG and d-NatQG graphs include information on chemical composition and geometry, as well as whole-graph attributes. Chemical composition is encoded by including the atomic number in the node attribute vectors. The graphs also include hydrogen atoms explicitly, as nodes, which allows for including features that are relevant in the chemistry of TMCs; e.g. hydride complexes and agostic interactions. For implicit representations, the number of hydrogen atoms attached to each node is also available. At the geometric level, the edges were informed with the interatomic bond distance. Further, a whole-graph attribute vector provides the charge of the TMC, its molecular mass, and the total number of atoms and electrons. In TMCs containing three-center bonding (3C), non-bonding (3Cn), and antibonding (3C*) NBOs, the data of these orbitals was also used to define the topology and attributes of the graphs. When BD and 3C orbitals overlapped at a given edge, the data of the latter was used to build the graph. Neither of the two graph representations contain information about the core and Rydberg NBOs.
In both u-NatQG and d-NatQG, the definition of the NBOs from a localized Lewis structure can partially break the symmetry of the system (e.g. trans-Cl bonds become non-equivalent in Fig. 2b), which may have an impact on the predictions made by the GNN models (vide infra). Further, the NatQG graphs encode the NBO orbitals implicitly, embedding their defining parameters into the node and edge attribute vectors of a molecular graph that, especially in the case of u-NatQG, can be directly related to the skeletal formula of the represented TMC. An alternative approach, recently explored by Gomes et al.,89 consists in representing LP and BD orbitals with additional explicit nodes. The NatQG graphs may also be used to develop a string representation with rich electronic structure information, similar to the representation developed by Dietz.90 Further, these graphs could also be useful in the context of the zero-order bond approach developed by Clark.91
The TMCs of the tmQMg dataset were extracted from the Cambridge Structural Database (CSD; 2020.0 release) by applying a series of filters on structure and composition, which yielded 3D-resolved non-polymeric and non-disordered structures with a single metal center, containing C and H, and also allowing for B, Si, N, P, As, O, S, Se, F, Cl, Br, and I. Co-crystallizing molecules (e.g. solvent) were excluded, and filters on charge (q) and the number of electrons (Ne) and atoms (Natoms) were also applied. The TMCs included in tmQMg have q ∈ {+1, 0, −1}, even Ne, and Natoms ≤ 85.
Fig. 5 shows a random selection of ten different TMCs, one for each transition metal group. From a composition perspective, and in addition to the metals, these complexes contain nine different elements (C, H, O, S, N, P, F, Cl, and Br), whereas, from a structural perspective, they contain nineteen different ligands, including both monodentate and chelating ligands, binding to the metal center in eight coordination modes (monodentate, (κ, η2), η5, κ5, (κ2, η2), κ2, η6, and κ3) and three coordination numbers (4, 6, and 8). The diversity of this small selection, which represents only 0.016% of the overall dataset, reflects the complexity of the chemical space within tmQMg.
For all TMCs in tmQMg, the quantum data was obtained from three different DFT calculations carried out for the closed-shell singlet state in this order:
(1) Full geometry optimization at the PBE-D3BJ/def2-SVP level.92–94
(2) Calculation of frequencies and thermochemistry at the PBE-D3BJ/def2-SVP level.92–94
(3) Single-point energy and NBO calculation at the PBE0-D3BJ/def2-TZVP level.93–95
When any of these three calculations failed, the system was excluded from the dataset. The overall success rate of the calculations was 88.7%. Calculation 1 yielded fully optimized energy minima. Complexes with the same stoichiometry and energy (e.g. duplicates and enantiomers) were excluded. In calculation 2, only geometries giving all-real frequencies were included in the dataset. In addition to the geometries, the following quantum properties were extracted from the output of these two calculations: the double-ζ potential, zero-point, internal, entropy, enthalpy, and free energies, heat capacity at constant volume, isotropic polarizability, and lowest and highest harmonic vibrational frequencies. Calculation 3 yielded the NBO parameters used to build and attribute the u- and d-NatQG representations (Fig. 2 and 3, and Tables 1 and 2), as well as the dipole moment, the triple-ζ potential and dispersion energies, the HOMO and LUMO energies, the HOMO–LUMO gap, and the natural charge of the metal center. All these quantum properties are included in the tmQMg dataset. The ESI† provides statistics on tmQMg, including molecular charge, size, and composition, as well as pair plots showing the degree of correlation between the different quantum properties (Fig. S1–S3†).
Besides the optimization of the GNN models reported in this study (vide infra), the NBO data available from tmQMg was also used to develop the NatQG representations. Whereas the connectedness of the d-NatQG representation (Fig. 3) was guaranteed by the SOPA-based definition of its topology, u-NatQG (Fig. 2) required a natural bond order threshold (ΛBO) to define a connected topology around the metal center. Fig. 6 shows how the number of disconnected graphs decreases with the ΛBO threshold for the whole tmQMg dataset. The Wiberg ΛBO reduced disconnectedness more rapidly than the NLMO; e.g. at ΛBO = 0.20, there were either 11034 (Wiberg) or 19
493 (NLMO) disconnected graphs. For this work, we used a Wiberg ΛBO ≥ 0.05 threshold to define the u-NatQG topology, with which only 3.9% of the graphs (2370 TMCs) remained disconnected. Many of these disconnected graphs represent group 11 and 12 TMCs with weakly bound molecular fragments that, from a covalent bond perspective, may not be considered metal ligands. ΛBO can thus be used as a parameter modulating the connectedness of the u-NatQG graphs depending on the strength of the metal–ligand bonds. In the disconnected graphs, the metal-free fragments can be easily identified as isolated subgraphs and be further processed as needed (by e.g. connecting or excluding them). With ΛBO ≥ 0.05, the average ratio over the entire dataset between NBO-based (eNBO) and ΛBO-based (eΛ) edges in the u-NatQG graphs was eNBO/eΛ = 10.4.
![]() | ||
Fig. 6 Number of disconnected u-NatQG graphs vs. the Wiberg (orange) and NLMO (blue) natural bond order thresholds (ΛBO). |
The metal complexes included in tmQMg exist in the CSD and, therefore, they are accessible through synthetic routes described in the literature. This feature may enhance the reliability of generative models trained with tmQMg, though it may also introduce biases (e.g. TMC tendency to form crystals of the quality required for structure determination by diffraction techniques). Further, the tmQMg dataset can be used to benchmark deep graph learning models for TMCs, including convolutional embedding.96 Another potential application of tmQMg is the transformation of the NatQG graphs into vector and string representations; e.g. autocorrelations97 and SELFIES,81 respectively.
The previous version of the dataset, tmQMg,71 did not provide the graphs and most of its quantum properties, including the geometry, were calculated with the semiempirical GFN2-xTB method. The update provided by tmQMg adds the quantum geometries and properties computed at the DFT PBE-D3BJ/def2-SVP//PBE0-D3BJ/def2-TZVP level. The two datasets can thus be combined to train Δ-ML98 models predicting xTB-to-DFT corrections. The tmQMg data is openly available at https://github.com/hkneiding/tmqmg.
For the MPNN models, we used the gated graph flavor, which includes a gated recurrent unit (GRU) to mitigate over-smoothing in message passing.86Fig. 7 shows the MPNN architecture used in this study, which, after embedding the node and edge attributes of the NatQG graphs, applies the GRU, and, in the readout layer, uses the set2set attention mechanism for pooling. We also experimented with the addition of a concatenation operation augmenting the set2set output with the whole-graph attribute vector before passing the final embedding to the prediction layer (MPNN⊕G model).
![]() | ||
Fig. 7 The MPNN architecture operating over the node (n), edge (e), and graph (G) attribute vectors of the u- and d-NatQG graphs (Fig. 2 and 3). ⊕ = concatenation. |
The MXMNet architecture encodes molecules as a multiplex graph including local and global representations in two separated layers. The local layer accounts mainly for covalent interactions and includes geometric information in the edges. In contrast, the global layer represents non-covalent interactions by connecting the atomic nodes within a cutoff distance of 10 Å. Besides standard message passing within each layer, a cross-layer mapping is used to exchange information between the two layers. Adding to the base implementation of Xie,99 in which the graphs were built and informed with a molecular mechanics force field, we developed an MXMNet model in which the NatQG graphs were used as the local layer.
The performance of the NatQG-based MPNN and MXMNet models on the test dataset was assessed using the metrics collected in Table 3 and the correlation plots shown in Fig. 8. In the prediction of the HOMO–LUMO gap, the u-NatQG MPNN achieved the highest accuracy with a MAE of 6.02 mHa and r2 = 0.910. This accuracy, in the milli-Hartree scale, appears to be remarkable given the complexity and diversity of the tmQMg dataset. The HOMO–LUMO gap is a key quantum property of TMCs related to stability and conductivity, which both have a strong impact on applications like catalysis and photovoltaic materials. The performance of the MXMNet models was poorer though they gave an interesting result; i.e. the u-NatQG and d-NatQG implementations achieved higher accuracies than the original base model based on molecular mechanics, showing the value of using electronic structure analysis data (here NBO) to define the topology and attributes of the graph.
Architecture | Graph | HOMO–LUMO gap | Polarizability | Dipole moment | |||
---|---|---|---|---|---|---|---|
MAE | r 2 | MAE | r 2 | MAE | r 2 | ||
MPNN | u-NatQG | 6.02 | 0.910 | 5.00 | 0.995 | 0.819 | 0.879 |
d-NatQG | 7.22 | 0.873 | 5.17 | 0.993 | 1.019 | 0.835 | |
MPNN⊕G | u-NatQG | 6.04 | 0.910 | 4.94 | 0.995 | 0.895 | 0.858 |
d-NatQG | 7.19 | 0.877 | 4.96 | 0.994 | 0.981 | 0.845 | |
MXMNet | Base | 9.36 | 0.778 | 4.83 | 0.994 | 0.943 | 0.805 |
u-NatQG | 8.22 | 0.800 | 3.76 | 0.997 | 0.849 | 0.850 | |
d-NatQG | 9.07 | 0.795 | 3.98 | 0.996 | 0.838 | 0.863 | |
SchNet | CRG | 12.6 | 0.693 | 6.81 | 0.991 | 1.45 | 0.729 |
EdgeUpdate | CRG | 10.2 | 0.785 | 5.67 | 0.993 | 1.13 | 0.696 |
DimeNet++ | CRG | 10.3 | 0.789 | 5.37 | 0.994 | 1.28 | 0.759 |
ALIGNN | CRG | 7.72 | 0.859 | 5.43 | 0.993 | 0.705 | 0.900 |
![]() | ||
Fig. 8 Correlation plots between the true values (i.e. DFT-computed) and the values predicted by the NatQG-based MPNN models. |
In contrast with the HOMO–LUMO gap, MXMNet made more accurate predictions for the polarizability and, based on the u-NatQG graph, yielded the lowest MAE of all models tested, with a value of 3.76 bohr3 (r2 = 0.997). With the best MPNN model, this MAE was larger (4.94 bohr3), though the r2 score remained high (0.995) due to the wide range and spread of the polarizability in the tmQMg dataset, compared to other popular datasets containing smaller organic molecules (e.g. QM9101). Regarding the dipole moment, both models yielded MAEs within the range of [0.819–1.019] D. An interesting result with MXMNet is that the base and the NatQG models gave very similar MAEs, with the latter being slightly smaller. This suggests that the partial loss of symmetry that may occur in some systems upon localizing the NBOs does not affect to a large extent the prediction of the dipole moment. Symmetry loss does not seem to have a strong impact on the MPNN models either, which yielded the second lowest MAE for the dipole moment (0.819 D).
GNN models based on cutoff radius graphs (CRG) derived from the atomic coordinates of the tmQMg dataset were also considered. In particular, the performance of the SchNet,102 SchNet with edge updates (EdgeUpdate),103 DimeNet++,104 and ALIGNN105 GNNs was assessed and compared to that of the MPNN and MXMNet. The advanced features of these models include continuous-filter convolutions (SchNet and EdgeUpdate), directional message passing with spherical harmonics (DimeNet++), and line graphs (ALIGNN). For all these four models, the CRG graphs were built with a topology based on a cutoff radius, informing the nodes with the atomic number and the edges with the interatomic distances. ALIGNN uses additional node attributes (e.g. group number and atomic volume) and geometric information (i.e. bond angles), which is also leveraged in DimeNet++. From the perspective of explainability, and in contrast with NatQG, these models are more difficult to relate to the chemical intuition around TMCs because the topology of the CRG graphs differs significantly from that of the skeletal formulae, and their attributes do not refer directly to the electronic structure descriptors used to rationalize the properties of TMCs. The metrics of Table 3 showed that, in general, the NatQG-based GNNs outperformed the CRG models with the exception of the dipole moment, for which ALIGNN gave the lowest MAE and largest r2 (0.705 D and 0.900, respectively).
The performance of the GNN models was also benchmarked against a baseline. The results obtained with the NatQG MPNN models, which were among the most accurate (Table 3), were compared to those obtained upon replacing all NBO data in the nodes and edges by generic properties. The (Z, T, S, χ) vector of properties, where Z = atomic number, T = valence (node degree), S = covalent radius, and χ = Pauling electronegativity, was used to attribute the nodes. These properties have been previously used to compute autocorrelation functions for TMCs.97 The edges were attributed with the (BO, d) vector, where BO = bond order and d = bond distance. Table 4 and Fig. 9 show the results obtained with this baseline representation, together with those of the u-NatQG and d-NatQG graphs. In addition to the HOMO–LUMO gap, polarizability, and dipole moment, the following quantum properties were also predicted: heat capacity, largest vibrational frequency, energies (HOMO, LUMO, electronic, dispersion, zero-point, enthalpy, entropy, and Gibbs), and thermodynamic correction (i.e. the difference between the Gibbs and potential energies). The latter correction, which is predicted with high accuracy (MAE = 1.06 mHa with u-NatQG), is relevant to the field of computational catalysis with TMCs, where it is often used to refine the energies.
Property | Baseline | u-NatQG | d-NatQG | |||
---|---|---|---|---|---|---|
MAE | r 2 | MAE | r 2 | MAE | r 2 | |
a Using linearly fitted atomic energy offsets. b At constant volume (i.e. Cv). c Difference between the Gibbs and potential energies. | ||||||
HOMO–LUMO gap | 8.33 | 0.835 | 6.04 | 0.910 | 7.19 | 0.877 |
Polarizability | 5.87 | 0.993 | 4.94 | 0.995 | 4.96 | 0.994 |
Dipole moment | 1.71 | 0.537 | 0.895 | 0.858 | 0.981 | 0.845 |
HOMO energy | 13.1 | 0.734 | 3.21 | 0.991 | 3.79 | 0.987 |
LUMO energy | 13.0 | 0.722 | 3.51 | 0.988 | 4.05 | 0.984 |
Electronic energya | 18.8 | 1.000 | 6.61 | 1.000 | 8.01 | 1.000 |
Dispersion energya | 1.72 | 0.993 | 1.45 | 0.995 | 1.44 | 0.995 |
Zero-point energya | 0.50 | 1.000 | 0.33 | 1.000 | 0.40 | 1.000 |
Enthalpy energya | 16.8 | 1.000 | 6.39 | 1.000 | 7.64 | 1.000 |
Heat capacityb | 0.25 | 1.000 | 0.18 | 1.000 | 0.22 | 1.000 |
Entropy energy | 2.34 | 0.994 | 1.95 | 0.996 | 2.07 | 0.995 |
Gibbs energya | 19.7 | 1.000 | 6.38 | 1.000 | 7.37 | 1.000 |
Thermodynamic correctionsc | 1.36 | 1.000 | 1.06 | 1.000 | 1.23 | 1.000 |
Largest vibrational freq. | 4.53 | 0.997 | 3.98 | 0.990 | 7.52 | 0.990 |
For all properties collected in Table 4, the NatQG MPNN models surpassed the accuracy of the baseline, showing the value of using the NBO data for attributing the graph nodes and edges. The only exception was the prediction of the largest vibrational frequency, for which the baseline was more accurate than d-NatQG but less accurate than u-NatQG. For some properties, including the zero-point and entropy energies, the baseline performed at a level similar to NatQG.
Interestingly, for the HOMO–LUMO gap, we observed the following changes in the performance of the model:
Another factor contributing to these observations can be the smaller difference between the input and the embedding dimensions, which is 90 for the u-NatQG representation and 123 for baseline – d.
In general, regardless of the property predicted, the models based on the undirected graphs outperformed the directed, which are also more computationally demanding because they contain more edges. The concatenation of the whole-graph attribute vector in the last layer of the MPNN⊕G model improved the results obtained with d-NatQG (Table 3). Further, for the training set, the best performance in the prediction of several properties was obtained with the directed graphs, which thus seem to have a lower generalization capacity (Tables S3 and S4†). However, in most cases, the MAE and r2 values obtained for both graph types were rather similar. The unusual topology of d-NatQG can exclude edges where chemical bonds are present (e.g. Pt–C bonds in Fig. 3), though it retains the fundamental interactions within TMCs (e.g. d → π* backdonation). The remarkable performance of d-NatQG in the GNN models shows the promise of directed graph representations expressing donor–acceptor interactions.
With the HyDGL program, the NatQG graphs can be easily built from NBO data, which is used to define both the topology and the attribute vectors. The graphs can be made either undirected (u-NatQG), like a conventional molecular graph, or directed (d-NatQG), for expressing donor–acceptor interactions. Both flavors are infused with electronic structure information that can be directly related to the textbook concepts used to rationalize the structure and reactivity of TMCs.
The NatQG graphs were used to optimize GNN models based on the MPNN and MXMNet architectures. These models predicted several quantum properties of TMCs with remarkable accuracy, including the HOMO–LUMO gap and the polarizability, outperforming other models based on different topologies (CRG graphs) and attributes (periodic table properties). Interestingly, numerical experiments showed that the electronic structure information boosted the models performance by an extent larger than the geometric information. Despite its unusual connectivity, the d-NatQG representation performed at a level similar to u-NatQG, showing the promise of directed donor–acceptor graphs in deep learning.
The results obtained with the NatQG GNNs will be a useful baseline for the development of machine learning models for complex molecular systems. These models can be also applied to the prediction of thermodynamic and kinetic parameters of chemical reactions catalyzed by TMCs https://doi.org/10.48550/arXiv.2011.14115. Further, the tmQMg dataset will be a valuable benchmark for future studies exploring deep graph learning for TMCs.
Footnote |
† Electronic supplementary information (ESI) available: Further information on the statistics of the tmQMg dataset and its outliers. Technical details of the GNN models, the baseline representation, and the linear fitting of the atomic energies used to predict energy targets. The error metrics obtained with the training dataset, the Python libraries used to develop the HyDGL code, and the computational details of the tmQMg dataset are also provided. See DOI: https://doi.org/10.1039/d2dd00129b |
This journal is © The Royal Society of Chemistry 2023 |