Michael
Tynes‡§
*ab,
Michael G.
Taylor
a,
Jan
Janssen
a,
Daniel J.
Burrill
ab,
Danny
Perez
a,
Ping
Yang
*a and
Nicholas
Lubbers
*c
aTheoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA. E-mail: mtynes@uchicago.edu; pyang@lanl.gov
bCenter for Nonlinear Studies, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
cComputer, Computational, and Statistical Sciences Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA. E-mail: nlubbers@lanl.gov
First published on 16th August 2024
Advances in machine learning have given rise to a plurality of data-driven methods for predicting chemical properties from molecular structure. For many decades, the cheminformatics field has relied heavily on structural fingerprinting, while in recent years much focus has shifted toward leveraging highly parameterized deep neural networks which usually maximize accuracy. Beyond accuracy, to be useful and trustworthy in scientific applications, machine learning techniques often need intuitive explanations for model predictions and uncertainty quantification techniques so a practitioner might know when a model is appropriate to apply to new data. Here we revisit graphlet histogram fingerprints and introduce several new elements. We show that linear models built on graphlet fingerprints attain accuracy that is competitive with the state of the art while retaining an explainability advantage over black-box approaches. We show how to produce precise explanations of predictions by exploiting the relationships between molecular graphlets and show that these explanations are consistent with chemical intuition, experimental measurements, and theoretical calculations. Finally, we show how to use the presence of unseen fragments in new molecules to adjust predictions and quantify uncertainty.
Model interpretability is often invaluable when ML is applied in the sciences both because it helps scientists understand when and how models make errors, thereby building trust in model predictions, and because it can help uncover trends in predictions, which can lead to enhanced scientific understanding. Methods for interpreting ML models in the context of chemistry and materials problems are reviewed thoroughly in ref. 18, 20 and we review key points here. In the ML interpretability literature, intrinsically interpretable models18,20 are those wherein the model functional form and mechanisms for prediction are transparent and meaningful. An often-cited example of such a model is linear regression, where each coefficient has a clear interpretation as the contribution of an individual feature to a prediction (even when these features may be difficult to interpret). In contrast, many ML models – e.g., deep neural networks (DNNs) – are too large or complicated to be interpreted directly and are thus referred to as black-box methods. Such methods can still be interpreted through post hoc procedures which are applied to models after training or to their predictions. Examples of such techniques include local linear surrogates which examine the contribution of input features near a given input point21 and similar local explanations called SHapley Additive exPlanations (SHAP) based on game-theory.22–24 DNNs are often interpreted by methods that examine the gradients of their predictions, e.g. through integrated gradients25 or the deep Taylor decomposition.26 Such post hoc methods have been applied to molecular property prediction and have been used to give explanations in terms of atoms,27 bonds,28 and higher-order molecular subgraphs.29
Some argue that because DNNs often provide the most accuracy across ML models and that numerous post hoc DNN explanation methods exist, these methods should be favored over intrinsically explainable models, which are commonly thought to be comparatively weak predictors.18 However, this perceived accuracy-interpretability tradeoff is often exaggerated and is sometimes orthogonal to observed trends.19 Furthermore, many post hoc interpretability methods have theoretical weaknesses that are empirically borne out by counter-intuitive and untrustworthy explanations. For example, ref. 30 discusses that numerous post hoc methods are not locally Lipschitz continuous, meaning that extremely small perturbations in model input can yield large changes in explanations, a phenomenon which makes explanations appear inconsistent to a human observer. More recently, ref. 31 showed that post hoc methods based on the deep Taylor decomposition can be manipulated to produce arbitrary explanations. Such findings have led to calls to use simpler, explainable models when possible.19
Intrinsically explainable models often have comparable predictive power to black box models, especially when constructed carefully.19 This general observation has been demonstrated recently in a materials science context where interpretable (multi-)linear models were applied material property prediction problems and achieved accuracy close to state-of-the-art nonlinear approaches.32 One of these model families was constructed by counting atomic n-grams present in a crystal lattice unit cell and assigning coefficients to their presence, inspired by the cluster expansion. Here we develop an analogous representation for organic molecules which we call the atom-induced molecular subgraph, or graphlet, representation, wherein a molecule is represented by constituent n-atom connected subgraphs. A similar representation was recently developed by ref. 9 and used to sample and characterize large chemical spaces of more than billions of molecules.33–35 Here we show that such a representation can be combined with linear models for competitive and interpretable prediction.
We construct accurate, explainable linear models which, inspired by the many-body expansion,36–38 approximate molecular properties as functions of molecular graphlets organized by their body-order, i.e., the number of atoms in each graphlet. We show that so-constructed linear models perform competitively with nonlinear black box models across a variety of structure–property prediction tasks. We then show that this approach can naturally be used to produce coherent, additive explanations by projecting higher-order model coefficients onto atoms or bonds within their molecular context, and empirically find correlation between these projections and both chemical intuition and chemical theory. Finally, we examine how graphlet train and test statistics can be used to estimate distribution shift39 and thereby quantify prediction uncertainty.
Graphlet fingerprints, like other fingerprint approaches, are built upon pre-defined type labels assigned to the atoms and bonds in a molecular graph. We label nodes by atomic species, formal charge, and aromaticity. This implies that every node type has a precise degree which is constant across all instances in all molecular graphs. Edges are labeled according to bond type as either single, double, triple, or aromatic. As a result, the graphlet statistics form a typed version of Dk degree statistics.41 This rich typing scheme helps to capture information more efficiently at lower maximum graphlet size; as an example, with only species based typing an N+ atom with 4 bonds is not distinguished from an N atom with 3 bonds until using a graphlet size of at least 5. Valence-based rich typing of atoms can resolve the difference between these atoms at a graphlet size of 1.
Graphlets are enumerated by a recursive algorithm similar to the explicit subgraph enumeration routine described in ref. 42 and 9, during which we identify and count membership in isomorphism classes through our hashing technique described in the next section.
To construct molecular graphlets, we consider the family of sets of atoms in for whom the induced subgraph is connected, denoted
(1) |
Loosely speaking, one can think of as playing the role of a cumulant expansion over the induced subgraphs formed by the power set of . Molecular graphlets are the classes of isomorphic subgraphs present in . It will be useful to consider induced subgraphs restricted to a given number of atoms, N, denoted as
(2) |
The graphlet fingerprint is the histogram CN of isomorphic induced subgraphs up to subgraph size N. With the Iverson brackets representing the indicator function and denoting the set of graph isomorphism classes, the components of CN are
(3) |
To efficiently track graphlet isomorphism classes , we build an integer-valued labeling (in other words, hashing) function H and produce counts of these labels. We construct a concrete histogram cN with components labeled h as
(4) |
To build this hashing function, we require pre-defined atom labels, , and pair labels, (atom and bond labels for the pair), as well as a labeling function hrec, which can identify a histogram ch by sorting the labels in the histogram and pairing them with the accompanying count. To label graphlets of arbitrary size, we construct a recursive labeling function H with hrec using hatom and hbond as base cases:
(5) |
In plain words, we label graphlet isomorphism classes by their own histograms of graphlets: triplets are labeled in terms of bonds and atoms, and four-vertex graphlets are labeled in terms of triplets, bonds, and atoms, etc. To our knowledge, this labeling scheme not been presented in any prior work. Because it is recursive, it has the advantage of allowing memoization while building the graphlet set. Whether or not the concrete labeling function H is a faithful realization of the ideal isomorphism classes is a complex question related to the long-standing graph reconstruction conjecture,43–46 which is notably true for some particular classes of graphs, false for others, and not settled for a great many cases. For the molecules and subgraph sizes studied here we have found no counterexamples.
(6) |
(7) |
Using ŶN = FN(cN) for arbitrary values of N, each constituent model fn is trained to minimize the same loss function evaluated against the residual y − Ŷn−1, that is to minimize with
(8) |
Put less mathematically, using the graphlet approach, we can build a function up by first applying regression to the graphlet counts generated by atoms, and then to the graphlet counts generated by bonds, and then to the graphlet counts generated by connected triples, and so on, up to some graphlet size N, where the model at size n learns a correction to the model at size n – 1. This same hierarchical approach can be analogously applied to the other 2D graph fingerprints we examine. For path-based fingerprints, the hierarchical levels indexed by n correspond to the number of steps in the graph walks or, equivalently, the number of bonds. For circular fingerprints, the hierarchical levels n indicate the set of fragment features with radius equal to n.
We consider the directed acyclic graph (DAG) of inclusion relationships between graphlets of varying size, defined as
(9) |
(10) |
For brevity, we will omit the and write these matrix elements as , but the matrix remains associated with a particular molecule.
We principally deal with the inclusions of size n graphlets within size n + 1 graphlets, which form an N-partite DAG with partitions for each graphlet size from n = 1, …, N. We refer to the partitions as levels. Much like in a feedforward neural network, each adjacent pair of levels is connected by a set of edges. These edges are a subset of those in G; the adjacency matrix Gn connecting nodes from level n + 1 to level n corresponds to
(11) |
We want our projections onto atoms or bonds to be additive, that is, to sum to the prediction of the linear model, so we want the transformation corresponding to each Gn to be sum conserving. To accomplish this, when projecting contributions from a size n + 1 graphlet to its size n graphlets, we evenly distribute this contribution across all of the size n graphlets. Mathematically, we can ensure this by normalizing the columns of Gn to sum to 1. We write the column-normalized adjacency matrix as Ĝn with columns defined as
(12) |
To develop our projections, we must think carefully about our linear model form. A linear model with weights β acting on graphlet histogram cN to estimate y is written as,
ŷ = β·cN. | (13) |
(14) |
When projecting “downwards” from larger to smaller fragments, denote the projection value on a set of atoms as . We write the vector of these projection, α, and coefficient, β, values associated with all of size n in a molecule as and . With this notation, we define the projection from level n + 1 to n as
(15) |
(16) |
Eqn (15) is sufficient to produce atom-level explanations by computing
For bond-level explanations, we introduce the reverse “upwards” projection from level n – 1 to n. To do so, we reverse the direction of the edges in the DAGs described above. The edges are now weighted by the total “valence” (in graph terms, the total edge weight) of one graphlet within another. Loosely speaking, these valence weights are the counts of bonds subsumed in the larger graphlet. More formally, the matrix K with elements given by
(17) |
(18) |
(19) |
A diagram of this scheme for interpretability projections is shown in Fig. 2. As a concrete example of this definition, a 2-graphlet (bond) containing a carbon and nitrogen that are double-bonded to each other will receive of the carbon-associated β and (2 out of 4 bonds) and of the nitrogen-associated β (2 out of 3 bonds) in an upward projection.
We combine the upward and downward projections at level n, denoted by , as
(20) |
For any level n,
(21) |
This equation represents the fact that our explanation vectors χ are additive in the sense that the sum of the explanation reproduces the prediction; the explanation breaks the prediction into component pieces. We primarily consider which corresponds to breaking up the prediction Ŷ into atom-centered terms and corresponding Ŷ decomposed into bond-centered terms, although one can compute for any n = 1, …, N.
We examine various methods of constructing uncertainty metrics based on unseen graphlets. While yet more approaches are possible, we explored using the total number of unseen graphlets, the fraction of unseen graphlets, and an auxiliary uncertainty regression model to weight the relative importance of unseen fragments of size s.
We can exploit statistical information in the distribution of graphlet coefficients to adjust predictions when a test molecule has unseen graphlets. We examine this in the context of predicting energies, where each graphlet is associated with a coefficient that can be thought of as the energy contribution of each graphlet. Ignoring unseen graphlets present in a new molecule causes the magnitude of a molecule's energy to be mispredicted. We adjust for this bias by finding the mean coefficients s for each size s = 1, …, N, constructing a histogram of counts dsN of unseen all fragments of size s, regardless of the fragment structure. The adjusted prediction ŷadj is then written as
ŷadj = β·cN + ·dN | (22) |
To examine regression performance, we first examine prediction on roughly 130k atomization energies from the QM9 (ref. 49) dataset computed using Density Functional Theory (DFT) and compare performance against a range of fingerprint-based regression methods applied to this dataset, including one applied by ref. 50. Later, we use splits of this dataset to evaluate our uncertainty quantification and prediction adjustment approaches. We then evaluate our method's performance on solubility prediction using four datasets from ref. 51 with sizes ranging roughly from 400 to 900 molecules and compare our results to those therein. Finally, we evaluate regression performance on nine drug discovery related quantities in datasets with sizes ranging roughly from 700 to 10000 molecules from ref. 52 and compare performance to leaderboards hosted online.53
To evaluate our interpretability projections, we qualitatively examine solubility predictions on the datasets from ref. 51 and, more quantitatively, correlate bond-projected energies to roughly 50000 bond dissociation energies calculated on a set of molecules from ref. 54.
Overall, graphlet-based linear models show stronger performance than both nonlinear models and models constructed with other fingerprints, as seen in Fig. 3 which summarizes our QM9 experiments (additional learning curves including training performance are given in ESI Fig. S1 and S2†). First, considering hierarchical models, performance improves consistently with fragment size to a mean absolute error (MAE) of less than 5 kcal mol−1 for all models. The best of these models uses graphlet fingerprints and attains a test MAE of 1.74 kcal mol−1. Although this is not quite the 1 kcal mol−1 widely considered to be chemical accuracy,62 it is less than the error of the DFT used to calculate these values of ΔHat.50 The RDKit fingerprint-based model follows this performance, with an MAE of 2.15 kcal mol−1. The performance of these two fingerprint variants is likely similar because they capture similar chemical information, i.e., atom-induced subgraphs vs. branched-path fingerprints which can be though of as edge-induced subgraphs. We hypothesize that the stronger performance of both graphlet and RDKit fingerprints over Morgan fingerprints is because they are a richer representation than the Morgan fingerprints, having many more features at a given size. We further hypothesize that the performance improvement of graphlet fingerprints over RDKit fingerprints is due to the relative redundancy present in RDKit fingerprints: because they are walk based, there are many RDKit fingerprint fragments corresponding to the same exact set of atoms for which there is necessarily only one graphlet.
Fig. 3 Test set performance vs. fragment count by model type, feature type, and hierarchicality on the QM9 ΔHat task. The horizontal axis is shown in number of fragments rather than fingerprint size because the size parameter has different meanings across fingerprint types. The dashed horizontal line at 2.3 kcal mol−1 indicates the accuracy of the DFT calculations that produced these ΔHat values compared to experiment.50 The solid horizontal line at 1 kcal mol−1 gives a common benchmark of “quantum chemical accuracy” at 1 kcal mol−1. The dotted horizontal line at 84.2 kcal mol−1 shows the best MAE attained by fingerprint-based models in ref. 50. |
The nonlinear LightGBM models outperform linear models with small fragment counts, but this relationship is reversed as fragment count increases. This is possibly because LightGBM does not capture the additivity of fragment energies reflected in the many body expansion ansatz, which, by contrast, is well-captured by linear models. Instead of assigning an energy to each graphlet, the LightGBM model must examine a large set of features to sift through the combinations which produce good estimates of total energy. This hypothesis is consistent with the rapid increase in error of non-hierarchical LightGBM models with fragment count, which have to consider all fragments at once. Hierarchicality also eliminates feature correlation induced by the inclusion of smaller fragments within large ones, which may explain the trend of slightly better performance among the hierarchical models in general and the much stronger performance of LightGBM using the hierarchical approach.
All models constructed here outperform the best fingerprint-based model performance reported in ref. 50, most by one to two orders of magnitude, regardless of fingerprint type, highlighting the importance of carefully selecting fingerprint parameters. Ref. 50 uses binary ECP4 (Morgan) fingerprints out to radius 4 folded to a fingerprint length of 1024. At radius 4, we find roughly 500000 Morgan fragments in the training set when using unfolded fingerprints (see ESI Table S1† for the exact number of fragments observed at each size for each fingerprint type). This close to 500× ratio between unique fragments and fingerprint entries likely explains the limitations of the model from ref. 50.
Fig. 4 Bond-level model interpretability: the projection of a ΔHat model onto bonds along experimental bond dissociation energies (BDE) for ethane, ethene, and ethyne obtained from ref. 63. Energies are in units of kcal mol−1. |
We construct a holdout set of BDEs to evaluate whether the correlation between is subject to a generalization gap. Half of the 5000 molecules present in both the theoretical BDE dataset and QM9 were held out from training. We then trained a graphlet-based hierarchical linear model to all QM9 molecules except those present in this holdout set (approximately 128k molecules), using a maximum graphlet size of 7.
The bond-projected predictions from these models show reasonable agreement with the theoretically calculated BDEs, especially considering that only single bond dissociations are present. On the held out bonds, we attain a Pearson r of 0.46 over all of the bonds, shown in Fig. 5 (notably, there is very little generalization gap in these correlations, as shown in ESI Fig. S3 and S4†). To test whether this correlation is driven by the relative strengths of single bonds between each element pair (an instance of Simpson's paradox64), we separate the data by element pairs and compute the correlations, shown in Table 1. Within the element pairs, the strength of the relationship between and BDE varies widely, with relatively strong performance for C–C bonds, weak performance for H–O bonds, and moderate performance for the remaining element pairs. Thus, our interpretability scheme recapitulates trends even within most individual bond types. In particular, heavy-atom to heavy-atom bond energies are better correlated with the BDE in comparison to hydrogen-heavy-atom bonds; the variance of the bond explanation for hydrogen atoms is noticeably smaller than the variance of bond explanations for heavy atoms. When interpreting these correlations, it is important to remember that this is a test of empirical correlation between qualitatively similar phenomena; the model was not trained in any way to predict BDEs – rather, it predicts total energies, and the bond-wise interpretation of these predictions is significantly correlated to the BDE. The model is unaware of open-shell molecules, radicals, or ions that are produced in breaking those bonds.
Fig. 5 Relationship between Bond Dissociation Energies (BDEs) computed with DFT and the bond-level interpretations for a model fit to atomization energy on the QM9 dataset. |
Bond | n | Pearson r |
---|---|---|
All | 24695 | 0.4198 |
C–H | 15980 | 0.1414 |
C–C | 4699 | 0.4968 |
C–N | 1064 | 0.3043 |
C–O | 1440 | 0.2266 |
H–N | 990 | 0.2298 |
H–O | 522 | −0.0584 |
Overall, we found that the graphlet fingerprints coupled with linear models predict small molecule solubility in four solvents competitively with nonlinear models from ref. 51, which were trained on expensive DFT- and experiment-based features. Fig. 6 shows test RMSE (log molarity) for each model presented in ref. 51 and for graphlet-based linear and LightGBM models.
Fig. 6 Model performance on the solubility datasets from ref. 51. Root mean squared errors are in units of log molarity. Models from ref. 51 are shown in shades of red in the left-hand side of each panel, models from this work are shown in shades of blue, offset on the right-hand side of each panel. |
Graphlet-fingerprint-based models are competitive on all datasets save for benzene, and are among the best for the water (wide) and water (wide-narrow) sets, while being both inexpensive and easy to interpret based on structural motifs (interpretations of solubility predictions are discussed in the following section, and shown in Fig. 7). Compared to nonlinear LightGBM models, linear models are unexpectedly strong on these tasks. This demonstrates that, surprisingly, our molecular representation coupled with linear models is useful outside the context of predicting extensive properties like energy.
The interesting sub-result of improved performance of the graphlet-based models when moving from the water to water (wide-narrow) suggests a robustness to overfitting. In the former task, only molecules of log-solubility ranging from −4 to −1 are included. In the latter task, the test set is the same, but the training set additionally includes molecules in the wider logS range of −12 to 2. In principle, the test task is statistically identical, but in the wide-narrow version, more information is given for training. Most of the models from ref. 51 nonetheless perform worse on the test task in the wide-narrow version; the new information somehow confounds the models. In contrast, our graphlet approach behaves intuitively – when given more data, it makes strictly better predictions on the same test set.
We note again the relatively low expense of our approach compared to the models in ref. 51; because the latter models rely on features that involve DFT and experimental measurements, applying the model to an arbitrary new chemical can be limited by the speed to calculate, find, or measure these quantities. In contrast, a fingerprinting approach such as graphlet fingerprints can be applied to completely new molecules in timescales far less than a second. For these tasks, there are on the order of hundreds to thousands of graphlet features; precise counts are given in ESI Table S2.†
These tasks form an important and challenging collection of properties which are useful because candidate drugs must perform satisfactorily with respect to a host of variables apart from biochemical mechanism of the drug itself, but to Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties that characterize how the molecule interacts with important aspects of human biochemistry. In ESI Section S5† we briefly overview the individual tasks addressed in this work; for details please see the TDC information.
These properties are contained in datasets of roughly 1000 to 10000 molecules. We followed the same train/val/test split recommended by the TDC. This is a challenging generalization evaluation which holds out all molecules built upon a particular scaffold (molecular backbone) and conducting training and hyperparameter optimization on the remaining molecules. The performance is averaged over 5 random training/validation splits. We fit both the LightGBM and ridge models with graphlet fingerprints.
Table 2 shows our performance compared to the existing leaderboard entries at the time of writing. A visualization of all of the models' performance for all leaderboard tasks is present in ESI Fig. S6 and S7.† Models using graphlet fingerprints score in the upper half of the leaderboard for seven out of nine of the tasks. Notably, all these high-scoring models used the LightGBM regressor; ridge regression did not perform impressively in these tests and was in the lower half of the leaderboard for five of the nine tasks. This is surprising in the context of ridge regression's strong performance on solubility prediction in Section 3.3, but less surprising in that we expect non-linear, non-extensive models to perform better on biological properties that may not be easy to represent linearly as a sum of all graphlet contributions.
Task | LightGBM perf. | LightGBM rank | Ridge perf. | Ridge rank | Perf. metric |
---|---|---|---|---|---|
LD50 | 0.603 | 3/19 | 0.632 | 8/19 | MAE |
AqSol | 0.803 | 5/16 | 1.105 | 15/16 | MAE |
Lipo | 0.519 | 5/17 | 0.516 | 4/17 | MAE |
PPBR | 8.729 | 6/17 | 10.699 | 15/17 | MAE |
Caco2 | 0.316 | 7/19 | 0.306 | 6/19 | MAE |
VDss | 0.500 | 8/17 | 0.448 | 13/17 | Spearman's ρ |
Half life | 0.217 | 12/18 | 0.229 | 11/18 | Spearman's ρ |
CL-Hepa | 0.341 | 13/16 | 0.349 | 12/16 | Spearman's ρ |
CL-Micro | 0.525 | 14/18 | 0.600 | 4/18 | Spearman's ρ |
The explicit uncertainty model is a linear regression mapping the number of fragments of each size to the absolute residuals. This model is fit with non-negative least squares, guaranteeing non-negative residual prediction and giving coefficients with a natural interpretation as the contribution of a single unseen fragment of a given size to the model uncertainty in units of the regression target. The resulting model coefficients are given in ESI Table S3.†
We measure the performance of these uncertainty quantification methods with both correlation coefficients and confidence curves. Confidence curves (CCs) show how the model error changes as data points (here, molecules) with the highest uncertainty are excluded from the test set.65,66 Confidence curves for different UQ metrics are compared quantitatively by comparing their integrals in the so-named AUC metric: the less area under the CC, the better the performance of the UQ metric. To give a more intuitive meaning to the AUC, the Area Under the Confidence Oracle (AUCO) metric presented in ref. 65 considers the area between the confidence curve and an oracle curve, that is the true ordering of the points by decreasing absolute error and is the best-case CC for a UQ metric. As AUCO approaches zero, the confidence curve approaches the oracle curve, so smaller AUCO values are better. To provide an even more intuitive functional of the CC, we consider both an oracle and an anti-oracle which randomly discards points. This serves as a baseline that any well-performing UQ metric should outperform. Because the anti-oracle throws away points randomly, the anti-oracle has an expected CC equal to the test set MAE. The area between the anti-oracle and oracle (ABAO) thus represents a baseline AUCO. The confidence curve efficiency metric CCeff is then defined as
(23) |
Values of CCeff close to one occur in the best case when the CC approaches the oracle CC, values near zero occur when the UQ metric is no better than random guessing, and negative values occur when the UQ metric is worse than random assignment. In this way, we can think of CCeff as being related to the AUC in the same way that R2 relates to MSE; a perfect CCeff is 1, an uninformative CCeff is 0.
All of the proposed measures of uncertainty based on unseen fragments have moderate to strong correlation with absolute residuals, shown in Fig. 9. Confidence curves and CCeff values are shown in Fig. 10. The fraction of unseen fragments performs the strongest under both correlation coefficients and our confidence curve efficiency metric.
We contrast the straightforward interpretability of our approach with complications associated with applying post hoc interpretability methods to models built on molecular fingerprints, which can lead to inconsistencies and confusions. For example, some research23,24 examines SHAP values on the subgraphs corresponding to fingerprint fragments. In this case, in addition to the inconsistency of SHAP discussed in ref. 30, SHAP explanations can include contributions associated with the absence of a particular fragment, thereby explaining properties of molecules based on what they are not. This is reasonable from a statistical perspective, but cannot provide a mechanistic explanation in terms only of the contents of the molecular graph being examined; if an interpretability method entangles the molecule with the training dataset, the resulting interpretation can be unintuitive. By contrast, interpreting linear model coefficients instead focuses only on the fragments that are present in the given molecule and model coefficients for components that are not present are irrelevant to the prediction. We also note that care must be taken to always use unfolded fingerprints when attempting to explain model predictions, or else the one-to-many correspondence between features and molecular fragments67–69 significantly complicates interpretation, if not rendering it fully impossible.
Still, even linear models with large numbers of coefficients can be difficult to interpret, motivating our projection interpretability scheme. In our graphlet models, the typical number of features can easily be in the thousands, and even up to a million features in the case of QM9 with large maximum fragment sizes. Also complicating interpretation of the model is the set of inclusion relationships between the fragments. To combat these issues, we used two techniques. First, we used a hierarchical fitting approach which aims to capture the largest amount of variation using the simplest graphlets, effectively attempting to impose a many-body hierarchy on the learned models. Second, we developed the interpretation projection technique to build atom, bond, or higher-body-order-graphlet level interpretations. We note that a similar interpretation method to our projection scheme is presented in ref. 70, wherein the SHAP attributions of all of the fingerprint fragments containing a given atom are summed, giving an atom-level SHAP contribution; our projections could be considered to be a more general version of this technique. An advantage of our approach used with linear graphlet models is that it is additive in the sense of eqn (21); the explanation breaks the prediction into component parts which add to recapitulate the prediction. Additivity over the feature space is also guaranteed by SHAP, but because this includes the features which are absent from the molecule having nonzero SHAP values as described above, the projection of SHAP values onto atoms or bonds is not additive, that is, the projections do not add to the final prediction.
We also performed uncertainty quantification using the unseen graphlets types not present in training. Our approaches have some similarity to UQ metrics based on the distance of an unseen point to its nearest neighbor(s) in the training set, which have long been applied in molecular property prediction, including with fingerprints71 and more recently with latent neural network representations.72 However, our approach does not require comparison to the training dataset, thus scales to arbitrarily large datasets; the cost of evaluating the UQ metric does not increase with the training set size, which is desirable in general and especially in applications such as molecular screening which may require very many uncertainty evaluations.
Some similarity may be noted between our work and that of atom-in-molecule-ons (AMONs),73 because each involves analysis of substructures. AMONs constitute a framework for the comparison of 3D configurations; in that lens, they are a composition of a selective (as opposed to exhaustive) fragmentation into subgraphs, and molecular similarity kernels.74 3D information about target molecules is typically used for contexts where the target property varies with respect to the input coordinates-for example, conformational energy variations; the cheminformatics applications presented in this work are distinct because they do not depend on conformation.
We note some advantages of graphlet fingerprints over other fingerprints, some of which are noted in ref. 9. Graphlet fingerprints may be considered at once more complete than Morgan-like fingerprints and more compact or less redundant than RDKit fingerprints. This is visible in the feature counts in Fig. 3, also shown in the ESI, Table S1.† Due to the radial nature, many substructures have no direct representation in Morgan fingerprints. Notably, Morgan fingerprints cannot explicitly represent individual bonds, which important chemical and physical meanings. Thus, bond-level interpretations like those in 3.2 are impossible with Morgan fingerprints. Likewise, RDKit fingerprints cannot directly represent atoms: paths of length one-bonds are the smallest fragments in RDKit fingerprints. RDKit fingerprints are also redundant in their representation of fragments in individual molecules when multiple bond paths connect the same set of atoms. For example, in a molecule with a ring containing n atoms, there is precisely one graphlet-based fingerprint fragment containing exactly those atoms, yet RDKit fingerprints will produce n – 1 fingerprint elements containing that same set of atoms, each one missing one bond from the ring. This leads to many-to-one correspondence between model coefficients and atom subsets, which presents a challenge to directly interpreting these coefficients. This redundancy may also challenge machine learning methods, as the fingerprint vectors will be highly correlated, even when models are fit hierarchically by fragment size. Hierarchical fitting helps to alleviate this redundancy by assigning model contributions to the lowest fragment size possible.
Regarding computational efficiency, we note that Morgan fingerprints are less costly to compute on large molecules with large fragment sizes due to their highly restrictive radial fragment definition. Graphlet fingerprints scale similarly in cost with molecule and fragment size to RDKit fingerprints, though are slightly less costly due to the aforementioned lack of redundancy with respect to subsets of atoms. This is unsurprising, as RDKit fingerprints can be thought of as edge-induced subgraphs rather than node-induced subgraphs.
We note that identifying and counting graphlets has long shown promise in other application areas of machine learning. A kernel based on the number of shared graphlets between graphs has been used to predict functional residues of proteins.75 Due to the potentially combinatoric cost of counting all graphlets on arbitrarily connected graphs, these methods often incorporate random sampling of small graphlets.76 Examining the symmetry relationships between nodes within graphlets has been exploited to understand protein–protein interactions77 and interactions between MicroRNA and small molecules.78 Various spectral methods based on graphlets have been developed79 and applied to problems such as biological network comparison.80,81 Recently, graphlet substructure relationships have been used to improve performance of graph neural networks.45
At the same time, we have shown that the transparent nature of fingerprint techniques comes with many additional advantages. For one, we show that hierarchical linear modeling in the graphlet approach, using a many-body expansion hypothesis, in some circumstances produces a more accurate model which is far more stable to the maximum graphlet size hyperparameter. We also show how graphlet inclusion relationships can be used to assign precise interpretations which decompose the model prediction across the input molecular structure. This was shown to produce reasonable correlation with chemical theory and chemical intuition in the case of both 2-body (bond) and 1-body (atom) projections. Finally, we showed how the interpretability of graphlet-fingerprint-based linear models provides natural methods for uncertainty quantification as well as model adjustments which address distribution shift, in particular, adjustments that estimate the effect of new topological structures arising in test molecules.
Future work could take on a variety of directions. For one, having shown comparable performance to methods from recent literature on a very wide array of tasks, the graphlet featurization approach is suitable for application to new chemical datasets. Methodologically, the uncertainty quantification and interpretability methods discussed in this work are but the tip of the iceberg; a variety of more complex schemes could be explored, and some of them might prove to produce better interpretations or more well-calibrated uncertainty estimates. We believe a comparison of the both the interpretability and UQ metrics presented here to others is warranted, along with evaluation of the UQ methods with other frequently used methods.82 The problem of modeling uncertainty and constructing interpretations when using these features in combination with external features (such as ab initio properties) remains unsolved; in some sense, any feature that is itself a function of the molecular structure cannot present information that is orthogonal to the graphlet histogram, and so a method to project the information gained by models through these features back onto the molecular graph would need to be devised in order to extend the notions of interpretations presented here. In this work, we have concentrated on modeling data whose domain is scalar, that is, where the prediction target for each molecule is a single number. However, the graphlet distribution can be localized to their location in the graph, and so the graphlet technique could be modified to predict information such as the partial atomic charges within a molecule. Finally, the large number of graphlet types in a fingerprint (up to ≈106 in this work) points to the possibility of using intelligent strategies for pruning the set of features of interest.83
Before concluding, we remind the reader that we have released the code for our approach as an open source package, , at https://github.com/lanl/minervachem, along with tutorials outlining how to build models using the methods described here. We hope that future developments by ourselves and others will be made available through the library.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00089g |
‡ Current address: Analytics, Intelligence, and Technology Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA. |
§ Current address: Department of Computer Science, The University of Chicago, Chicago, IL 60637, USA. |
This journal is © The Royal Society of Chemistry 2024 |