Alberto
Fabrizio
ab,
Andrea
Grisafi
cb,
Benjamin
Meyer
ab,
Michele
Ceriotti
cb and
Clemence
Corminboeuf
*ab
aLaboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland. E-mail: clemence.corminboeuf@epfl.ch
bNational Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
cLaboratory of Computational Science and Modeling, IMX, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
First published on 9th September 2019
Chemists continuously harvest the power of non-covalent interactions to control phenomena in both the micro- and macroscopic worlds. From the quantum chemical perspective, the strategies essentially rely upon an in-depth understanding of the physical origin of these interactions, the quantification of their magnitude and their visualization in real-space. The total electron density ρ(r) represents the simplest yet most comprehensive piece of information available for fully characterizing bonding patterns and non-covalent interactions. The charge density of a molecule can be computed by solving the Schrödinger equation, but this approach becomes rapidly demanding if the electron density has to be evaluated for thousands of different molecules or very large chemical systems, such as peptides and proteins. Here we present a transferable and scalable machine-learning model capable of predicting the total electron density directly from the atomic coordinates. The regression model is used to access qualitative and quantitative insights beyond the underlying ρ(r) in a diverse ensemble of sidechain–sidechain dimers extracted from the BioFragment database (BFDb). The transferability of the model to more complex chemical systems is demonstrated by predicting and analyzing the electron density of a collection of 8 polypeptides.
Properties that can be derived exactly from the electron density distribution include molecular and atomic electrostatic moments (e.g., charges, dipole, quadrupoles), electrostatic potentials and electrostatic interaction energies. Knowledge of these quantities is fundamental in diverse chemical applications, including the computation of the IR intensities,12 the identification of binding sites in host–guest compounds,13–15 and the exact treatment of electrostatics within molecular simulations.16 Moreover, analyzing the deformation of ρ(r) in the presence of an external field provides access to another set of fundamental properties, namely molecular static (hyper)polarizabilities and, thus, to the computation of Raman spectra17 and non-linear optical properties.18–21
The natural representation of the electron density in real space makes it especially suitable for accessing spatial information about structural and electronic molecular properties, including X-ray structure refinement22–27 and representations using scalar fields.6 Routinely used examples include the quantum theory of atoms in molecules (QTAIM),28,29 the density overlap region indicator (DORI),30 and the non-covalent interaction (NCI) index.31,32
ρ(r) is generally obtained by solving the electronic structure problem through ab initio computations. The main advantage of this approach is that it returns the variationally optimized electronic density for a given Hamiltonian. Yet, ab initio computations can become increasingly burdensome if ρ(r) has to be evaluated for thousands of different molecules or very large chemical systems, such as peptides and proteins. These large scale problems are typically tackled using a more scalable approach that consists of either using linear scaling techniques such as Mezey's molecular electron density LEGO assembler (MEDLA)33,34 and adjustable density matrix assembler (ADMA),35–37 as well as approaches based on localized molecular orbitals, such as ELMO.38–41 Another methodology belonging to this second category involves the use of experimental techniques, such as X-ray diffraction, to probe the electron density and subsequently reconstructing ρ(r) through multipolar models42–44 and pseudo-atomic libraries, such as ELMAM,45–48 ELMAM2,49,50 UBDB,51,52 Invarioms53 and SBFA.54 While successful, these two methodologies have intrinsic limits: the first is unable to capture the deformations of the charge density due to intermolecular interactions unless a suitable fragment is generated ad hoc, while the second relies on experimental data and is difficult to extend to thousands of different chemical systems at once. Recently, the development of several machine-learning models targeting the electron density has effectively established a third promising methodology, with the potential to overcome the limitations of the more traditional approaches.
The first machine-learning model of ρ(r) was developed on the basis of the Hohenberg–Kohn mapping between the nuclear potential and the electron density.55,56 Although successful, the choice of the nuclear potential as a representation of the different molecular conformations and the expansion of the electron density in an orthogonal plane-wave basis effectively constrained this landmark model to relatively small and rigid molecules with limited transferability to larger systems. Recently, we proposed an atom-centered, symmetry-adapted Gaussian process regression57 (SA-GPR) framework explicitly targeting the learning of the electron density.58 Using an optimized non-orthogonal basis set, pseudo-valence electron densities could be predicted in a linear-scaling and transferable manner, meaning that the model is able to tackle much larger chemical systems than those used to train the regression model. A third approach, that can also achieve transferability between different systems, uses a direct grid-based representation of the atomic environment to learn and predict the electronic density in each point of the molecular space.59–61 Representing the density field on a large set of grids points rather than on a basis set effectively avoids the introduction of a basis set error, but also dramatically increases the computational effort.
One should also consider that machine learning, being a data-driven approach, requires high-quality, diverse reference data. Fortunately, several specialized benchmark databases that target NCIs have appeared over the past decade. From the original S22 (ref. 62) to NCIE53,63 S66,64 NBC10/NBC10ext,65–67 and S12L,68,69 the evolution of these datasets has, generally, followed a prescription of increasing the number of entries, principally by including subtler interactions and/or larger systems. In this respect, the databases of Friesner,70 Head-Gordon,71 Shaw,72 and the recent BFDb of Sherrill,73 constitute a special category because of their exceptional size (reaching thousands of entries) which are now sufficiently large to be compatible with machine-learning applications. Beyond their conceptual differences, each of these benchmark sets aims at improving the capability of electronic structure methods to describe the energetic aspects of non-covalent interactions.
In this work, we introduce a dramatic improvement of our previous density-learning approach by making the regression machinery of ρ(r) compatible with density-fitting auxiliary basis sets. These specialized basis sets are routinely used in quantum chemistry to approximate two-center one-electron densities. Here, the auxiliary basis sets are used directly to represent the electron densities that enter our machine-learning model, with the additional advantage of avoiding the arbitrary basis set optimization procedures on the machine-learning side. This enhanced framework leverages the transferability of our symmetry-adapted regression method and is capable of learning the all-electron density across a vast spectrum of 2291 chemically diverse dimers formed by sidechain–sidechain interactions extracted from the BioFragment Database (BFDb).73 The performance of the method is demonstrated through the reproduction of ρ(r) between and within each monomer forming the dimers. The accuracy of the predicted densities is assessed by computing density-based scalar fields and electrostatic potentials, while the errors made with respect to the reference densities are computed by direct integration on three-dimensional grids. As a major breakthrough, the model is used to predict the charge density of a set of 8 polypeptides (∼100 atoms) at DFT accuracy in few minutes.
The decomposition of the electron density in continuous atom-centered basis functions is the cornerstone of the scalability and transferability of our SA-GPR model. Besides being generally desirable, these properties are actually crucial to accurately describe the chemical diversity present in the BioFragment database within a reasonable computational cost. On the other hand, the projection of the density field onto a basis set leads to an additional error on top of that which can be ascribed to machine learning. In practice, all the efforts placed into achieving a negligible machine-learning error are futile if the overall accuracy of the model is dictated by a large basis set decomposition error.
Standard quantum chemical basis sets are generally optimized to closely reproduce the behavior of atomic orbitals75 and results in unacceptable errors if used to decompose the electronic density (Fig. 1). In contrast, specialized basis sets used in the density fitting approximation (also known as resolution-of-the-identity (RI) approximation)76–82 are specifically optimized to represent a linear expansion of one-electron charge densities obtained from the product of atomic orbitals. Using the RI-auxiliary basis sets {ϕRIk}, the total electron density field can be expressed as:
(1) |
As shown in Fig. 1, the use of the RI-auxiliary basis sets results in nearly two orders of magnitude increase in the overall accuracy with respect to the corresponding standard basis set. The addition of diffuse functions marginally improves the performance of the decomposition, but leads to instabilities of the overlap matrix (high condition number) and increases dramatically the number of basis functions per atom.
In practice, Weigend's cc-pVQZ/JKFIT81 basis set (henceforth: cc-pVQZ-RI) offers the best trade-off between accuracy and computational demand and therefore represents the best choice for the density decomposition.
As shown in Fig. 2, the complete set of 2291 dimers spans a large variety of dominant interaction types, ranging from purely dispersion dominated complexes (in blue) to mixed-influence (green and yellow) to hydrogen-bonded and charged systems (red). We retain the same classification criteria as in the original database to attribute the nature of the dominant interaction.
Fig. 2 Ternary diagram representation of the attractive components of the dimer interaction energies for the 2291 systems considered in this work. The values of the SAPT analysis are taken from ref. 73. |
For each dimer, the reference full-electron density has been computed at the ωB97X-D/cc-pVQZ level using the resolution of identity approximation for the Coulomb and exchange potential (RI-JK). This implies that RI-auxiliary functions up to l = 5 are included for carbon, nitrogen and oxygen atoms while auxiliary functions up to l = 4 are used for hydrogen atoms.
(2) |
Fig. 3 Learning curves with respect to RI-expanded densities (ML error). (left) weighted mean absolute percentage error (ερ (%)) of the predicted SA-GPR densities as a function of the number of training dimers. The weights correspond to the number of electrons in each dimer and the normalization is defined by the total number of electrons. Color code reflects the number of reference environments. (right) ερ (%) of the predicted SA-GPR densities (M = 1000) divided per dominant contribution to the interaction energy according to ref. 73. |
As shown in the first panel of Fig. 3, 100 training dimers were sufficient to reach saturation of the density error around 0.5% for M = 100. This result already outperforms the level of accuracy reached in our previous work, which is remarkable given the large chemical diversity of the dataset and the consideration of all-electron densities. Learning curves obtained with M = 500 and M = 1000 show steeper slopes, approaching saturation at about 2000 training dimers with errors that were reduced to ∼0.2–0.3%. The predicted full-electron densities are five times more accurate than the previous predictions of valence-only densities (approximately 1%).58 A more detailed analysis of the M = 1000 learning curve reveals a strong dependence on the nature of the dominant interaction (Fig. 3). Specifically, a stronger non-local character in the interaction yields a larger error. This is especially prevalent for dimers dominated by electrostatic interactions (i.e., hydrogen bonds, charged systems), which are characterized by errors that are twice as large as those found in other regimes.
The origin of this slow convergence arises from two factors. First, only about 20% of the dimers are dominantly bound by electrostatics.73 The priority of the regression model is thus to minimize the error on the other classes. Second, there is a fundamental dichotomy between the local nature of our symmetry-adapted learning scheme and the long-range nature of the interactions. In fact, the electron density encodes information about the whole chemical system at once, while the machine-learning model represents molecules as a collection of 4 Å wide atom-centered environments. This difference in the spatial reach of the information encoded in the target and in the representation is a limitation. In this respect, a global molecular representation, which includes the whole chemical system, would be more suitable, but this would imply renouncing to the scalability and transferability of the model. Given a large enough training set, however, our SA-GPR model is able to capture the density deformations due to the field generated by the neighboring molecule. The reason is rooted in the intrinsic locality of density deformations and in the concept of “nearsightedness”83,84 of all local electronic properties, which constitutes a theoretical justification for a local decomposition of such quantities.
The fundamental advantage of setting the electron density as the machine-learning target is the broad spectrum of chemical properties that are directly derivable from ρ(r). For instance, the predicted charge densities are the key ingredient in density-dependent scalar fields aimed at visualizing and characterizing interactions between atoms and molecules in real space. Examples of the density overlap region indicator (DORI)30 are given in Fig. 4 for representative dimers. Compared to the rather featureless ρ(r), DORI reveals fine details of the electronic structure, which constitute a more sensitive probe for the quality of the machine-learning predictions. In particular, it reveals density overlaps (or clashes) associated with bonding and non-covalent regions on equal footing through the behavior of the local wave-vector (∇ρ(r)/ρ(r)).85–87
Fig. 4 DORI maps of representative dimers for each type of dominant interaction (DORI isovalue: 0.9). Isosurfaces are color-coded31 with sgn(λ2)ρ(r) in the range from attractive −0.02 a.u. (red) to repulsive 0.02 a.u. (blue). In particular, sgn(λ2)ρ(r) < 0 characterizes covalent bonds or strongly attractive NCIs (e.g. H-bonds); sgn(λ2)ρ(r) ∼ 0 indicates weak attractive interactions (van der Waals); sgn(λ2)ρ(r) > 0 repulsive NCIs (e.g. steric clashes). |
As shown in Fig. 4, the intra- and intermolecular DORI domains obtained with the SA-GPR densities are indistinguishable from those in the ab initio maps. This performance is especially impressive for the density clashes associated with low-density values, as is typical for the non-covalent domains. All the features are well captured by the predicted densities ranging from large and delocalized basins typical of the van der Waals complexes (in green) to the compact and directional domains typical of electrostatic interactions to intramolecular steric clashes (e.g. phenol, mixed regime). A quantitative measure of the DORI accuracy for the most characteristic basin of each type of interaction is reported in the ESI.† Overall, these results illustrate that the residual 0.2% mean absolute percentage error does not significantly affect the density amplitude in the valence and intermolecular regions that are accurately described by the SA-GPR model. The highest amplitude errors are concentrated near the nuclei in the region dominated by the core-density fluctuations.
The versatility of the machine-learning prediction is further illustrated by using the predicted densities to compute the molecular electrostatic potential (ESP) for the same representative dimers (Fig. 5). ESP maps based on predicted densities agree quantitatively with the ab initio reference and correctly attribute the sign and magnitude of the electrostatic potential in all regions of space. Importantly, the accuracy of the ESP magnitude remains largely independent of the dominant interaction type. This is especially relevant for charged dimers (electrostatics) as it demonstrates that despite slower convergence of the learning curve for this category, the achieved accuracy of the model is sufficient to describe the key features of the electrostatic potential.
Fig. 5 Electrostatic potential (ESP) maps of representative dimers for each type of dominant interaction (density isovalue: 0.05 e− bohr−3). ESP potential is given in Hartree atomic units (a.u.). |
The most widespread applications of ESP maps exploit qualitative information (e.g., identification of the molecular regions most prone to electrophilic/nucleophilic attack) but the electrostatic potentials can be related to quantitative properties such as the degree of acidity of hydrogen bonds and the magnitude of binding energies.88–92 As a concrete example related to structure-based drug design, we used a recent model that estimates the strength of the stacking interactions between heterocycles and aromatic amino acid side-chains directly from the ESP maps.88,91,93 This model derives the stacking energies of drug-like heterocycles from the maximum and mean value of their ESP within a surface delimited by molecular van der Waals volume (at 3.25 Å above the molecular plane).88 Following this procedure, we used the ESP derived from the ML predicted densities to compute the binding energies between a representative heterocycle included in our dataset, the tryptophan side-chain, and the three aromatic amino acid side-chains (Fig. 6).
Fig. 6 (left) Electrostatic potential maps 3.25 Å above the plane of the tryptophan (TRP) side-chain. The van der Waals volume of TRP is represented in transparency. The color code represents the electrostatic potential in kcal mol−1 according the scale chosen in ref. 88. (Right) Stacking interaction energies of TRP with the phenylalanine (PHE), tyrosin (TYR) and tryptophan (TRP) side-chains computed as detailed in ref. 88 on the basis of ab initio (top) and ML-predicted (bottom) ESP. |
Comparison between ab initio and ML predicted stacking interaction energies shows that the deviations in the ESP maps lead to minor errors on the order of 0.05 kcal mol−1. The largest deviations in the ESP would appear further away from the molecule, beyond the region exploited for the computation of the energy descriptors (i.e., the sum of the atomic van der Waals radii). The predicted ESP shows larger relative deviations far from the nuclei owing to the error propagation of the density predictions ρ(r) to the electrostatic potential ϕ(r). This can be best understood in the reciprocal space, where the deviations of the potential at a given wave-vector k are related to the density error by δ(k) = 4πδ(k)/k2. Because of the k−2 scaling, the error on ϕ(k) increases as k → 0, implying that larger relative errors of the electrostatic potential are expected in regions of space where ϕ(r) is slowly varying (i.e., thus determined by the long-wavelength components).
Overall, the predictions lead to a low average error of only 1.5% for the 8 polypeptides, which is in line with the highest density errors obtained on the BFDb test set. Relevantly, the largest discrepancies are obtained for 3WNE, which is the only cyclopeptide of the set. The origin of these differences can be understood by performing a more detailed analysis of a representative polypeptide, the leu-enkephalin (4OLR). The errors in this percentage range do not affect the density-based properties, such as the spatial analysis of the non-covalent interactions with scalar fields (Fig. 7 top right panel). Yet, the density differences indicate that the highest absolute errors occur along the amino acid backbone (Fig. 7 lower panels). In addition, the analysis of the relative error with the Walker–Mezey L(a,a′) index34 shows the highest similarity at the core (99.3%), slowly decreasing while approaching the non-covalent domain (96.3%) (Fig. 7 top left panel). The L(a,a′) index complements the density difference information by showing that the actual density amplitudes and the prediction error do not decrease at the same rate. Nevertheless, the loss of relative accuracy remains modest and the quality of the density is mainly governed by the predictions along the peptide backbone, which are especially sensitive for the more strained 3WNE cyclopeptide. Although similar chemical environments were included in the training set, the error is mainly determined by the lack of an explicit peptide bond motif and cyclopeptides in the training set. While this limitation could be addressed by ad hoc modification of the training set, the overall performance of the machine-learning model is rather exceptional as it provides in only a few minutes, instead of almost a day (about 500 times faster for e.g. enkephalin with the same functional and basis set), electron densities of DFT quality for large and complex molecular systems. For comparison, the superposition of atomic densities (i.e., the promolecular approach), which has been used to qualitatively analyze non-covalent interactions in peptides and proteins (e.g.ref. 32) lead to much larger mean absolute percentage errors (17 times higher, see Fig. S1 in the ESI†).
Fig. 8 Weighted mean absolute percentage error (ερ (%)) with respect to ωB97X-D/cc-pVQZ densities of the predicted densities extrapolated for 8 biologically relevant peptides (protein databank ID). |
The model reaches a 0.3% accuracy on a validation set, that is sufficient to investigate density-based fingerprints of NCIs, and to evaluate the electrostatic potential with sufficient accuracy to quantitatively estimate residue–residue interactions. The transferability of the model is demonstrated by predicting, at a cost that is orders of magnitude smaller than by explicit electronic structure calculations, the electron density for a demonstrative set of oligopeptides, with an accuracy sufficient to reliably visualize bonding patterns and non-covalent domains using the DORI scalar field. Even though the model reaches an impressive accuracy (0.5% mean absolute percentage error) for dimers that are predominantly bound by electrostatic interactions, the comparatively larger error suggests that future work should focus on resolving the dichotomy between the local machine learning framework and the long-range nature of the intermolecular interactions.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02696g |
This journal is © The Royal Society of Chemistry 2019 |