Alberto
Fabrizio‡
ab,
Ksenia R.
Briling‡
a and
Clemence
Corminboeuf
*ab
aLaboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland. E-mail: clemence.corminboeuf@epfl.ch
bNational Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
First published on 4th April 2022
Physics-inspired molecular representations are the cornerstone of similarity-based learning applied to solve chemical problems. Despite their conceptual and mathematical diversity, this class of descriptors shares a common underlying philosophy: they all rely on the molecular information that determines the form of the electronic Schrödinger equation. Existing representations take the most varied forms, from non-linear functions of atom types and positions to atom densities and potential, up to complex quantum chemical objects directly injected into the ML architecture. In this work, we present the spectrum of approximated Hamiltonian matrices (SPAHM) as an alternative pathway to construct quantum machine learning representations through leveraging the foundation of the electronic Schrödinger equation itself: the electronic Hamiltonian. As the Hamiltonian encodes all quantum chemical information at once, SPAHM representations not only distinguish different molecules and conformations, but also different spin, charge, and electronic states. As a proof of concept, we focus here on efficient SPAHM representations built from the eigenvalues of a hierarchy of well-established and readily-evaluated “guess” Hamiltonians. These SPAHM representations are particularly compact and efficient for kernel evaluation and their complexity is independent of the number of different atom types in the database.
The crucial role of representations is mirrored by the intensive work that has been dedicated to finding ever more reliable and widely applicable fingerprints.7,8 Although there are effectively infinite ways to input the information about a molecule into a machine learning algorithm, conceptually molecular representations could be subdivided into well-defined macro categories. Chemoinformatics descriptors are a comprehensive set of fingerprints that relies either on string-based fingerprints, such as SMILES9,10 and SELFIES,11 or on readily available and descriptive properties, such as the number of aromatic carbon atoms, the shape index of a molecule, and its size,12–16 which are usually chosen using an a priori knowledge about their correlation with the specific target.17 A second class of chemical representations has been introduced very recently, relying on artificial neural networks (and no human input) to infer suitable descriptors for the learning exercise.18 Finally, physics-based or quantum machine learning (QML) representations include all those fingerprints inspired by the fundamental laws of physics that govern molecular systems, in particular, the laws of quantum mechanics and the basic laws of symmetry.
As physics-based representations are rooted in fundamental laws, they are directly applicable to any learning task, ranging from the regression of molecular properties to revealing the relationship between molecules in large chemical databases. Although existing quantum machine learning representations have drastically different mathematical forms and physical motivations, they all share the same starting point: the position (and often the type) of the atoms in real space. This choice is not arbitrary and it is intimately related to the connection between (static) molecular properties and the electronic Hamiltonian Ĥ.
For a fixed nuclear configuration, the information about all the electronic properties of a molecule is contained in the many-body electronic wavefunction Ψ(x1, …, xn), as defined by the Schrödinger equation. Since the electronic Hamiltonian defines Ψ(x1, …, xn), the molecular information necessary to fix Ĥ is in principle sufficient for a non-linear model to establish a one-to-one relationship with any electronic property. The expression for all the universal (i.e. non-molecule specific) terms of the Hamiltonian (e.g. kinetic energy) only requires the knowledge of the total number of electrons (N). In contrast, the external potential (the electron-nuclear attraction potential) also depends on the position of the nuclei {RI} and their charges {ZI}.19 Under the assumption of charge neutrality (i.e.), RI and ZI uniquely fix the form of the Hamiltonian and thus represent the only required information to characterize the electronic wavefunction and electronic properties.
Since no two different molecules have the same Hamiltonian, any representation that relies upon RI and ZI is guaranteed to satisfy the injectivity requirement of machine learning, i.e. there must be a one-to-one map between the representation of a molecule and its properties. Nonetheless, injectivity is not the only condition necessary for efficient and transferable learning. A representation must encode the same symmetries as the target property upon any transformation of real-space coordinates (equivariance): rotation, reflection, translation, and permutation of atoms of the same species.6,20,21
To organize in separate groups all the existing physics-based representations, it is fundamental to define a metric for the classification. Among all the possibilities, it is useful for the purpose of this work to classify representations according to the way they use and transform their molecular inputs.
One well-established methodology is to build representations using atom-centered continuous basis functions from an input containing the type and the position of the nuclei. This choice is the common denominator of a series of representations such as the Behler–Parrinello symmetry functions,22–24 the smooth overlap of atomic positions (SOAP),6,21 the overlap fingerprint of Goedecker and coworkers,25 the N-body iterative contraction of equivariant features (NICE),26 and the atomic cluster expansion (ACE).27–29
Other descriptors such as the many-body tensor representation (MBTR),30 permutation invariant polynomials (PIPs),31–35 and graph-based representations36 rely on the transformation of the structural input into a system of internal coordinates and use directly this information to establish similarity measures.
A third possibility is to build representations as fingerprints of potentials. This family includes the Coulomb matrix (CM),37,38 the bag of bonds (BoB),39 (atomic) spectrum of London and Axilrod–Teller–Muto potential [(a)SLATM],40 the long-distance equivariant (LODE) representation,41 FCHL18,42 and FCHL19.43
More recently, sophisticated neural network architectures, such as OrbNet,44,45 have shown that it is possible to use even more complex quantum chemical objects as input features, such as the tensor representation of quantum mechanical operators and their expectation values obtained from a converged semi-empirical computation.
In this work, we propose a different approach to designing physically motivated and efficient QML representations. The spectrum of approximated Hamiltonian matrices (SPAHM) looks back at the common origin of physics-based representations and uses the electronic Hamiltonian as the central ingredient to generate an input for machine learning algorithms. In contrast to standard geometry-based descriptors, the Hamiltonian encodes all the relevant quantum chemical information at once and it is able not only to distinguish different molecules and conformations, but also different spin, charge, and electronic states. Importantly, SPAHM representations do not require any self-consistent field (SCF) computation, as they all leverage the simplest, yet powerful, quantum chemical trick: the use of well-established, low-cost “guess” Hamiltonians, which are traditionally used to jump-start the SCF procedure. These matrices are cheaper to compute than a single SCF iteration (see Section 3.5 for more details on efficiency) and form a controlled hierarchy of increasing complexity and accuracy that is readily computed for any given molecular system. As a proof of concept, we focus in this work on SPAHM representations built from the eigenvalues of the “guess” Hamiltonians. This choice is physically and chemically motivated, naturally invariant under the basic symmetries of physics, and, in contrast to existing QML representations, include seamlessly the information about the number of electrons and the spin state of a molecule. In addition, the choice of eigenvalues results in a small-size representation, which is particularly efficient for kernel construction and rather independent of the degree of chemical complexity in the databases. Eigenvalue-based SPAHMs are global representations, which have the benefit to be rather accurate for molecular properties, but are not as transferable as local representations and are not applicable to regress local (atomic) targets (e.g. atomic partial charges).6 Nonetheless, the SPAHM representations are not restricted to eigenvalues and could be constructed from other (atom-centered) properties, such as the “guess” Hamiltonian matrix elements, the eigenvectors, and their corresponding density matrices.
The initial guesses were obtained in a minimal basis (MINAO46). All quantum chemical computations were made with a locally modified version of PySCF.54,55 The codes used in this paper are provided in a Github repository at https://github.com/lcmd-epfl/SPAHM and are included in a more comprehensive package called Q-stack (https://github.com/lcmd-epfl/Q-stack). Q-stack is a library and a collection of stand-alone codes, mainly programmed in Python, that provides custom quantum chemistry operations to promote quantum machine learning. The data and the model that support the findings of this study are freely available in Materials Cloud at https://archive.materialscloud.org/record/2021.221 (https://doi.org/10.24435/materialscloud:js-pz).
The CPU timings were recorded on 24-core CPU servers (2x Intel Xeon CPU E5-2650 v4@2.20 GHz), using one thread. The code was run with the packages from anaconda-5.2.0 (python-3.6) together with numpy-1.16.4, pyscf-2.0.0a, and qml-0.4.0. The user time was measured with the getrusage() system call and averaged over eight runs.
The QM7 dataset37 was randomly divided into a training set of 5732 molecules and a test set containing the remaining 1433 compounds, corresponding to an 80–20% split. For each molecule, we constructed the SPAHM representations by diagonalizing the different “guess” Hamiltonians in a minimal basis set (MINAO46) and using the sorted occupied eigenvalues as the KRR fingerprint. The occupied orbital energies carry information about both the atom types (core-orbital eigenvalues), the general electronic structure of the molecule (core and valence), and the total number of electrons. In addition, the eigenvalues of a Hamiltonian are naturally invariant under all real-space transformations (permutation, translation, rotation) and the size of the occupied set is independent of the choice of the atomic orbital basis.
The learning curves for all the SPAHM representations are reported in Fig. 1. In addition, we report the curves of the original version (eigenvalue) Coulomb matrix (CM)37 and SLATM,40 as the first has a similar size and diagonalization philosophy as SPAHM and the second is an example of a widely-used global representation.
Fig. 1 (Left) Learning curves in logarithmic scale of (a) atomization energies, (b) dipole moments, (c) HOMO energies, and (d) HOMO–LUMO gaps. The color code reflects the different representations. (Right) Illustrative example of the sizes of the CM, SPAHM, and SLATM representations. All the Hamiltonians were evaluated in the MINAO46 minimal basis. |
In Fig. 1, the SPAHM representations are indicated by the type of approximate Hamiltonian used for their construction. Although it is rather complex to establish a definitive hierarchy of self-consistent-field guesses, it is always possible to provide a more qualitative trend based on the amount of physics that each guess includes. The diagonalization of the core (Hcore) and the generalized Wolfsberg–Helmholz (GWH).58 Hamiltonian matrices are the simplest approximations, as they do not try to model any two-electron term. Building on the GWH guess, the parameter-free extended Hückel method uses approximate ionization potentials as diagonal terms and it is generally more robust.59,60 The superposition of atomic densities (SAD)61–63 is another popular choice that however only produces a density matrix (DM). Nonetheless, it is rather straightforward to construct a complete Hamiltonian matrix (including the one- and two-electron terms) by contracting the SAD density matrix with a potential of choice (Hartree–Fock or any exchange-correlation density functional). We report the SAD learning curve in Fig. 1 using the PBE0 potential, as all the properties were computed with this functional. Finally, the superposition of atomic potentials (SAP)60,64 and the Laikov–Briling (LB)65 guesses use effective one-electron potentials to construct sophisticated, yet computationally lightweight, guess Hamiltonians.
Besides the internal hierarchy, the accuracy of all the SPAHMs is always comprised between SLATM and the eigenvalues of the Coulomb matrix. While SLATM consistently outperforms the SPAHM representations, the difference with the most robust guesses (LB and SAD) is usually much smaller than the accuracy of the functional itself (∼5 kcal mol−1).66 Importantly, SLATM is also three orders of magnitude larger than SPAHM on QM7. The significant difference in the extent of the representation is crucial from an efficiency perspective, as the number of features dictates the computational effort of constructing the kernel matrix for an equal size of the training set. While lightweight, efficient, and naturally accounting for the charge state of molecules, we only tested a few well-known Hamiltonians for building SPAHM. As the performance of SPAHM is largely independent of the choice of the basis set or potential (see ESI†), it is necessary to consider alternative strategies to improve its accuracy. The heavy dependence of the learning on the quality of the parent Hamiltonian suggests that the construction of “better guesses” is the correct direction. Nonetheless, as discussed in Section 3.2, better guesses does not necessarily mean “improved quantum chemical approximate Hamiltonians” (i.e. closer to the converged Fock matrix), but rather the construction of simpler, systematic Hamiltonians specifically optimized for the learning task.
Besides the rather simple organic molecules of QM7, we tested the accuracy of the best performing SPAHM representation (LB) on larger molecules, transition metal complexes, and conformers. Even in these more challenging chemical situations, SPAHM shows the same relative performance with respect to existing representation as in QM7 (see ESI, Section II†).
Fig. 2 Learning curves (in logarithmic scale) of atomization energies for SPAHM based on the converged PBE0 Fock matrix and on the LB Hamiltonian, with an increasing (pseudo-)random perturbation added to the representation vector. The perturbation max. magnitude is reflected in the legend. For comparison, a learning curve for a fully-random representation is shown. All the Hamiltonians (including converged PBE0) were evaluated in the MINAO46 minimal basis. |
The performance of the converged Fock matrix versus more approximated (and computationally cheaper) Hamiltonians shows that “more physics” is not necessarily the key to better learning. Yet, the relative ordering of the guess Hamiltonians suggests that higher-quality potentials correlate with the best representations. The question associated with the relevance of the physics could be generalized and one could ask if there is the need for any physics at all or random featurization could lead to the same (or better) results.67 In addition, SPAHM representations are so small (33 features for QM7) with respect to the size of the dataset (7165 molecules), that the learning could be the result of random correlations between the features and the target properties. Overall, it is still unclear if any random (i.e. not physically motivated) perturbation of SPAHM could lead to better learning. To analyze the behavior of our most robust representation upon random perturbation, we modify SPAHM-LB by adding an increasing (pseudo-)random perturbation sampled from a uniform distribution and testing its accuracy on the QM7 database.
Fig. 2 shows that, for the smallest perturbation tested (magnitude max. 0.001), the original and the modified learning curves are almost indistinguishable, except for a non-significant difference due to the shuffling of the training set. As the allowed random noise increases, we observe the systematic and progressive worsening of the learning exercise towards the limit of physically meaningless random numbers. As the magnitude of the perturbation increases, the physics in the representation fades, the error increases, and the learning curves become flatter. This demonstrates that the performance of SPAHM is not just a consequence of a random correlation between the feature vectors and the properties.
Fig. 3 shows the learning curves for the SPAHMs based on the LB Hamiltonian with core, valence, and the full occupied eigenvalues sets taken as representation vectors. While the valence set results in consistently better learning than the core, both are necessary to achieve the overall accuracy of SPAHM-LB with the exception of the HOMO energies. For the HOMO eigenvalues, the valence set can be considered an alternative type of Δ-learning,68 where the approximate baseline (approximate HOMO energies) are the input of the kernel itself, rather than corrected a posteriori. Importantly, core orbital energies are not sufficient information to accurately regress the atomization energies, but the valence set error is also twice as large as the total SPAHM. Therefore, while the information about the chemical bonding is more relevant for the general performance, the information about the number and the type of nuclei in the molecules is essential to improve learning.
Fig. 3 Learning curves (in logarithmic scale) for the SPAHMs based on the LB Hamiltonian with core, valence, and the full occupied eigenvalues sets used as representations. All the Hamiltonians were evaluated in the MINAO46 minimal basis. |
To demonstrate the difference in performance between geometry- and Hamiltonian-based representations on more complex databases, we randomly selected one-half of the QM7 set (3600 molecules) and computed at fixed geometries the properties of the double cations (M++) and radical cations (M+˙). In this way, we constructed three additional sets: (a) neutral molecules and double cations (7200 molecules, 5760 in the training and 1440 in the test set); (b) neutral molecules and radical cations (7200 molecules, 5760 in the training and 1440 in the test set); and (c) neutral molecules, double cations, and radical cations (10800 molecules, 8640 in the training and 2160 in the test set). We set the learning task to predict the HOMO (SOMO for radicals) orbital energies and report the learning curves in Fig. 4. As expected, CM and SLATM fail the learning exercise and the curves are flat for all three sets.
SPAHM representations include the charge information on two separate levels. First, as we only include the occupied space, the length of the representation changes if we remove electrons. For instance, the SPAHM representations of neutral molecules M and their double cations M++ differ by one entry in length. Second, some of the approximate potentials at the origin of SPAHM, e.g. LB, SAP, and SAD, are sensitive to electronic information and result in different Hamiltonian matrices when the number of electrons differs. For instance, the LB Hamiltonian relies on a constraint that fixes the charge corresponding to the effective Coulomb potential (LBm in Fig. 4). While both LB (with no modified potential) and LBm learn, it is evident from Fig. 4, that SPAHM-LBm provides more robust predictions, resulting in errors one order of magnitude smaller than SPAHM-LB at small training set sizes.
The spin-state information is also readily included in SPAHMs by the same method used in traditional quantum chemistry: separating the α and β-spaces and concatenating α and β orbital energies in a matrix of size Nα × 2 (the β-orbitals column is padded with zeros). Using the Laplacian kernel function, this choice ensures that for closed-shell molecules the similarity measures are the same as described above when the fingerprint is a single vector of length N/2.
Hamiltonian-based representations such as SPAHM outperform geometry-based representations in every database where electronic information is fundamental to distinguish molecules. The three sets proposed above are an example, but they are also quite peculiar since we do not allow geometries to relax. As a more realistic example of a database where electronic information is essential, we compare the overall performance of SPAHM (LBm Hamiltonian) and SLATM on the L11 set.50 L11 consists of 5579 small molecular systems (single atoms and atomic ions were excluded from the original set) characterized by a substantial diversity in terms of chemistry, charge, and spin states. From L11, 4463 molecules were randomly selected as the training set and the remaining 1116 – as the test set. To assess the accuracy of the representation, we set the norm of the dipole moment (for charged systems the origin is chosen to be the geometric center) as the learning target and report the learning curves in Fig. 5. Since there are no identical geometries in the set, the injectivity rule is not violated for SLATM, and the representation learns. However, even with relaxed structures, geometry-based representations struggle with a database containing mixed electronic information since they are not smooth with respect to the target property (similar geometries can correspond to significantly different values), and SPAHM, incorporating seamlessly the electronic state information, performs better.
Fig. 5 Learning curves (in logarithmic scale) of dipole moments for the SPAHM based on the LB Hamiltonian (LBm) compared to SLATM for the set of ref. 50 (L11). |
For this reason, we report in Fig. 6 the CPU timing for SLATM and SPAHM-LBm on the QM7 and the L11 (ref. 50) databases (more details in the Computational methods). For both representations, we recorded the time for building the representation itself and the time for constructing the kernels.
Fig. 6 User times (in logarithmic scale) measured for computing the SLATM and SPAHM representations and the molecular kernels on (a) the QM7 dataset37 and (b) the database of ref. 50 (L11). |
As SPAHM representations based on occupied orbital eigenvalues are particularly compact, SPAHM-LB is significantly more efficient than SLATM in the kernel construction for both databases. This result follows directly from the fact that, for a fixed dataset size, the complexity of the kernel construction depends only on the number of features (SLATM: 10808 features on QM7 and 249255 on L11; SPAHM: 33 on QM7 and 64 on L11).
The relative efficiency of the construction of the representation itself is more sensitive to the dataset. To better understand the behavior of both SLATM and SPAHM-LBm it is crucial to clarify the composition of each database. QM7 includes molecules up to 7 non-hydrogen atoms with a limited chemical complexity (H, C, N, O, S). In contrast, L11 contains all the elements hydrogen through bromine (noble gases excluded) and as such includes a significantly diverse chemistry. SLATM is faster to evaluate than SPAHM-LB in the QM7 dataset, as the amount of different many-body types is very limited. However, on more complex databases with diverse chemistry and atom types, representations like SLATM become heavy in both computer time and memory (SLATM binary file occupies 11 GiB of memory vs. 5.5 MiB of SPAHM). As the cost of the guess Hamiltonian does not depend on the number of different atom types, SPAHM representations are extremely efficient for chemically complex molecular databases.
Footnotes |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d1dd00050k |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2022 |