Claudio
Zeni
*ab,
Andrea
Anelli
cd,
Aldo
Glielmo
be,
Stefano
de Gironcoli
b and
Kevin
Rossi
*df
aMicrosoft Research, Cambridge, UK. E-mail: claudiozeni@microsoft.com
bPhysics Area, International School for Advanced Studies, Trieste, Italy
cRoche, Pharma Research and Early Development, Therapeutic Modalities, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Grenzacherstrasse 124, 4070, Basel, Switzerland
dInstitute of Materials, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
eBank of Italy, Italy
fDepartment of Materials Science and Engineering, Delft University of Technology, 2628 CD, Delft, The Netherlands. E-mail: K.R.rossi@tudelft.nl
First published on 15th November 2023
In committee of experts strategies, small datasets are extracted from a larger one and utilised for the training of multiple models. These models' predictions are then carefully weighted so as to obtain estimates which are dominated by the model(s) that are most informed in each domain of the data manifold. Here, we show how this divide-and-conquer philosophy provides an avenue in the making of machine learning potentials for atomistic systems, which is general across systems of different natures and efficiently scalable by construction. We benchmark this approach on various datasets and demonstrate that divide-and-conquer linear potentials are more accurate than their single model counterparts, while incurring little to no extra computational cost.
Consequently, a large number of strategies have been developed towards the making of fast-and-accurate MLPs. These include linear7–9 or kernel10,11 methods leveraging a fixed atom-density representation and deep learning approaches where representations are learned by means of non-linear feed-forward,12 convolutions, attention mechanisms, or message-passing operations.13,14 In particular, the latter paradigm recently demonstrated state of the art accuracy and robustness in realistic tests.15–20
In spite of their lower accuracy, linear models remain attractive since they are computationally fast both in the training- and in the prediction-stage. It is however a matter of debate whether their flexibility can capture the full complexity of interatomic interactions in systems with non-trivial phase diagrams, possibly also presenting very different electronic structure features across their phase space.
This question also holds, to a certain extent, for any non-specialized MLP, regardless of whether it exploits linear or non-linear approaches. While MLPs are often transferable,21–23 reports in the literature show that specifically tailored training sets yield models that are more accurate in predicting the properties of atomistic configurations mostly similar to the ones in the training set.24–27
In this manuscript, we discuss the development and application of a divide-and-conquer (DC) strategy28,29 to fit accurate linear MLPs. The latter consists in training a committee of expert models, where each expert is trained on a small, independent, and pre-selected subset of the full training set. The predictions of the specialized members are then combined so that contributions of the model(s) more likely to be accurate are dominant.
Previous reports in the literature also hinted at the benefit of breaking down the problem of fitting a general model for atomistic systems, across large and complex training sets. Deringer et al.30 fine-tuned the regularisation coefficients associated with different training-points to construct an accurate model leveraging a SOAP representation and a Gaussian Process regressor. Mazouin et al.31 and Lemm et al.32 demonstrated that the learning of HOMO–LUMO gaps in QM9 molecules is facilitated when training on subsets that discriminate conformers which present a ring, a chain, or other characteristic motifs. Cheng et al.33,34 showed that a clustering scheme exploiting chemo-informatics features by-passes the need for human intervention in the discrimination of molecules with different chemistries. They further observed that a regression + unsupervised clustering protocol provides optimal accuracy, transferability, and learning efficiency when applied to the learning of molecular energy in a dataset of thermalized drug-like molecules. Goldsmith et al.35 showed how sub-group discovery enables the unravelling of specific structure–property relationships within sub-populations of a large material property database. In the domain of coarse grained force-fields, a surface-hopping scheme – where different regions of the conformational space induce the use of different coarse Hamiltonians – has also been successfully developed by Bereau and Rudzinski.36
By testing on community benchmarks, we find that DC linear models consistently outperform their single linear model counterparts regardless of the chemistry of the system considered. This conclusion holds both for the case of small benchmark datasets37 (around 300 global configurations each, comprising bulk cells, surfaces, vacancies, and thermal equilibrium MD runs) as well as for a larger and more complex one30 (around 5000 global configurations comprising liquid, amorphous, polymorphic crystals, and layered structures of bulk phosphorus). The accuracy gain is particularly significant when linear potentials are built under the constraint of modelling low body-order interatomic interactions. The proposed DC approach thus paves the way to a scalable and efficient strategy to enhance the accuracy of fast linear models.
To formalise the DC strategy, we write the relationship between the potential energy E(S) of a system S and the machine learning (ML) function that aims at predicting the latter, under the assumption that it can be decomposed into single atomic energy contributions ε defined for interactions within a cut-off distance, as follows:
(1) |
(2) |
Y = WQ | (3) |
Yd = [Ed, f1x, f1y, f1z,…, fSx, fSy, fSz], | (4) |
(5) |
The computational cost of a force-energy prediction with ridge regression is associated with the calculation of the descriptor Q for the structure d and the single matrix product WQ. For the case of a divide and conquer potential comprising M members, the latter step has a cost which is paid M-times, while the computational cost of the former step remains the same. Nevertheless, we note that the dominating term is, in both cases, the calculation of the descriptor Q. The computational cost of performing M matrix products in DC potentials is, therefore, effectively negligible. The linear potential weights W are found analytically by solving a matrix-inversion problem
W = (QTQ + λI)−1QTY, | (6) |
(7) |
(8) |
We empirically find that using a fourth power in the denominator of eqn (8) yields smooth transitions between clusters, while not reducing the accuracy of potentials.
The choice of the number of expert models, M, is a key variable in a DC strategy, which may also lead to overfitting. In practice, we treat M as an additional model hyper-parameter and select the value which maximises the following score:
(9) |
Another free parameter in the DC MLP fit is the regularization term of each expert model. In principle this hyper-parameter could be optimized for each specialized model. In practice we observe negligible gain and employ the same regularization term for each expert model, during each case study.
While the current implementation performs the evaluation of wm according to a distance criterion, we do not exclude that other approaches, e.g., supervised ones, could provide room for improvements in accuracy. Similarly, we do not exclude that more sophisticated approaches to combine the predictions of the different expert models could further enhance the performance of a DC MLP.
To conclude, we note a parallelism between our DC approach and the one inherent to linear decision trees (where linear models are fitted to the data in each leaf of a decision tree), with the caveat that continuity in our predictions is ensured by the DC model weighting scheme (eqn (1)). By the same token, we highlight that eqn (1) could be also interpreted as a perceptron model. In this view, wm acts as an activation function, which depends on the global structure of the system, while the single-atom contributions are calculated using linear ML units. Furthermore, we observe that one could also consider the DC weight evaluation as a classification step, through which a most suitable linear model (among a set of available one) is selected, given a certain test point.
To build linear and DC linear models, we represent local atomic environments up to the cut-off radii indicated by Zuo et al.37 (Mo = 5.2 Å, Si = 4.7 Å, Ge = 5.1 Å, Cu = 3.9 Å, Ni = 4.0 Å, and Li = 5.1 Å) and employ a fixed total radial and angular basis set expansion order Nmax + Lmax = 12 for the ACE descriptors. We fix force and energy regularisations at 10−6, while the number of specialised models is optimized for each dataset according to the criterion of eqn (9) and is MMo = 7, MSi = 2, MGe = 3, MCu = 5, MNi = 6, and MLi = 2.
In the left panel of Fig. 2 we report MAEs on energy predictions with a 2-, 3-, or 4-body descriptor (top to bottom graphs). In the right panel of Fig. 2 we present the same information but for the force MAEs. We observe that DC models consistently outperform the accuracy of linear ones. The accuracy gain is marginal (few percent) in the case of systems which were already accurately predicted by the linear model, i.e., Cu, Ni, and Li. The improvement is instead quite sizeable for the two other systems, namely, Mo and Ge. The case of force-and-energy predictions in Si is the only one where a balanced competition between the two approaches is observed: the linear model is slightly more accurate in energy predictions, while the DC model displays a better performance in force predictions.
Fig. 2 Box plot for test error on atomic energies (left column) and forces (right column) yielded using linear models (blue) and DC models (orange) for the Zuo et al.37 dataset. The first row refers to 2-body ACE descriptors, the second row to 3-body and the fourth row to 4-body ones. |
An analysis of the DC accuracy, as a function of the chosen interaction body-order, shows that more significant (relative) accuracy gains take place at lower body-orders. We note that low body-order MLPs are also the fastest to compute and, in turn, DC approaches provide a promising route in the deployment of low-resource, accurate, and interpretable MLPs.
Additionally, the proposed DC approach provides a relative improvement which is larger for energy predictions than for force predictions. We rationalize this trend in light of the fact that we cluster data points according to global features. This, in turn, is likely to facilitate the learning of global properties, such as the system total energy. While not explored in the present manuscript, we note that the DC approach could be modified so as to cluster data based on local atomic features rather than on summary statistics of those quantities.
We fit a linear and a linear DC potential using the ACE representation with a radial cut-off of 5 Å, a maximum basis set size of 12, and considering correlations up to the 5-body order. A regularisation parameter of 10−7 is used for both the linear and the DC model fits. The DC potential fitting finds an optimal number of M = 6 clusters, following the approach detailed in eqn (9).
In Fig. 3, we report the distribution of errors incurred in energy (left panels) and force (right panels) prediction for the linear (top panels) and DC models (lower panels). A parity plot in the inset shows the relationship between true and predicted values. We observe that, also in the case of a more complex benchmark, the DC accuracy is consistently higher than the linear model's one. In particular, the accuracy gain is, again, mostly observed in the energy fitting (LP MAE = 0.128 eV per atom vs. DC MAE = 0.060 eV per atom), while forces improve by a thinner margin (LP MAE = 0.250 eV Å−1vs. DC MAE = 0.238 eV Å−1). Crucially, the partitioning of the fitting into smaller models has the largest effect on the tails of the force and energy error distributions, which are less populated for the DC models.
Fig. 3 Error distribution in energy (left panels) and forces (right panels) prediction for the phosporus dataset by Deringer et al.30 incurred by a linear (blue – upper panel) and a DC (orange – lower panel) model. The inset shows parity plots between true and predicted values. Points are colour coded according to the density (yellow = high density and blue = low density) in that region. |
The complexity of the P database and the remarkable errors incurred by the linear and the DC models in certain regions of the phase space motivate a more detailed discussion. To this end, we analyse in Fig. 4 the relationship between the energy errors incurred by the linear and the DC models for each test structure in the P database, the minimum distance of each test point from the DC cluster centroids, and the L2 norm of the DC model weights.
The Fig. 4 left panel shows the kernel density estimate of the distribution of the (DC and LP) errors, where the straight line is a guide to the eye indicating equal accuracy between the predictions of the linear and DC potentials. All the points lying above (below) the line correspond to a configuration where the DC energy prediction is more (less) accurate than the one of the linear model. The plot thus highlights, from an additional perspective, how the DC model accuracy is more accurate than the linear one, for the majority of the test points.
The Fig. 4 central panel illustrates the relationship between the error incurred in energy prediction on the phosphorus dataset by the DC model and the minimum scaled distance of ρ() from cluster centroids. A (positive) correlation between the two quantities emerges when looking at the behaviour of the 90-th percentile of the atomic energy error incurred by the DC model as a function of the minimum distance from cluster centroids (black line and dots). We deduce that the (scaled) minimum distance from cluster centroids minm‖(ρ() − μm)/σm‖2 can provide an upper bound to the error incurred on energy predictions, as structures whose representations are far from every model centroid are also likely to be out of the training distribution. This finding is in line with previous observations25 about the interplay between linear model error and the degree to which a test point lies in a well-sampled region of the representation space.
Fig. 4 right panel displays DC errors as a function of the L2 norm of the DC weights. The latter is generally close to 1, suggesting that for each prediction only a single expert model largely contributes to the overall DC model outcome. While good accuracies are witnessed when a single model is found to dominate the predictions within the DC potential, significant errors may nonetheless be registered even when a (presumed) expert model dominates the committee prediction. We rationalize this observation in terms of the possible detrimental effect of the soft-max regularization of the DC weights (eqn (7) and (8)) on the correlation between DC error and distance from the closest cluster centroid, as the single closest model will be chosen with high certainty even in cases where the structure is far from every DC model centroid.
To further characterize the working mechanisms of a DC model, we report in Fig. 5 the error incurred by each member m in the DC model, against the error incurred by a linear model trained on the full dataset, and colour-code points S according to the weight wm(S) of the expert model m in the overall DC prediction. Fig. 5 shows that sizeable changes in the prediction error take place when different sets of training configurations are considered. In particular, each expert model offers accurate predictions, which are on average superior to the one of the linear model trained on the full dataset, in specific regions of the phase space. These correspond to points where the expert model's relative contribution to the DC prediction is sizeable (i.e., above 0.5). Significantly, there exist configurations for which neither the expert models nor the full linear model are truly equipped to provide an accurate prediction. On these occasions, a single expert model may be elected as the most specialized one in offering a given prediction; this behaviour follows from the geometric criterion elicited in eqn (7) and (8).
From the analysis of Fig. 4 and 5, we conclude that the functional form in eqn (1) provides a grounded route to weight the predictions of expert models, so as to elect the most accurate ones as the most significant in the DC prediction. Nevertheless, the DC approach offers no guarantee of improvement in the robustness of predictions in regions of the phase space, which have not been densely sampled during the construction of the training set.
Alternative approaches could otherwise leverage the DC weights' estimate stage to further introduce anomaly detection or one-class classification and readily detect datapoints for which a model trained on the available dataset is more likely to provide unfaithful predictions.
As with all linear and kernel-based methods, the re-training of a DC ensemble of potentials can be performed analytically and computationally efficiently, without performing costly gradient-based optimisation of parameters. Moreover, since potentials within the DC approach are localized in the space of descriptors, only models for which new incoming data is relevant need to be updated during the active learning loop, thus further reducing the computational resources required.
By the same token, the proposed DC strategy can be further naturally evolved to support Δ-learning45,46 schemes. In the assumption that a fairly accurate and largely transferable force-field (see, e.g., ref. 50 and 51) or MLP (see e.g., ref. 17, 26, 52 and 53) exists, we then envision a strategy where the latter is used as a baseline, and a number of expert model corrections, whose learning is efficient by virtue of transfer learning,26 may act on the general model to further improve its accuracy.
One could also write a DC correction to an existing baseline energy model as
(10) |
While the accuracy of DC MLPs leveraging linear models may not be on par with models exploiting learned representations, DC models display a substantial accuracy gain, in exchange for a negligible computational burden, against linear potentials. This result makes them an attractive tool when speed and efficiency in training and prediction are key figures of merit.
At a speculative level, we discuss how the DC approach could be extended to integrate uncertainty estimates and efficiently integrated into Δ-learning and active-learning strategies.
By showing that a committee of experts strategy can be successfully leveraged for MLP development, our work opens a new avenue for the design of accurate and scalable ML models for materials.
This journal is © The Royal Society of Chemistry 2024 |