Matthew D. Witman*a and
Peter Schindler
*b
aSandia National Laboratories, Livermore, California 94551, USA. E-mail: mwitman@sandia.gov
bNortheastern University, Boston, Massachusetts 02115, USA. E-mail: p.schindler@northeastern.edu
First published on 9th December 2024
Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local vs. global property prediction tasks, small vs. large datasets, and structure vs. compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.
One approach to constructing OOD test sets is to utilize unsupervised clustering with a chosen materials featurization and then conduct leave-one-cluster-out CV (LOCO-CV). For example, on compositional models for superconducting transition temperatures, LOCO-CV revealed how generalizability and expected accuracy are drastically overestimated due to data leakage in random train/test splits.8 Omee et al. have investigated the performance of OOD prediction tasks on MatBench9 datasets (refractive index, shear modulus, and formation energy) utilizing structure-based graph neural network (GNN) models and LOCO-CV (k-means clustering and t-distributed stochastic neighbor embedding).10 Hu et al. similarly have utilized LOCO-CV to study the improvement of OOD generalizability of various domain adaptation algorithms during materials property predictions (experimental band gaps and bulk metallic glass formation ability).11
Quantifying distribution shifts in materials databases over time and identifying whether specific samples are OOD have been shown critical for developing databases and models that promote greater robustness and generalizability.12 To quantify whether data points are OOD can be assessed based on their distance to training data in feature space (e.g., via kernel density estimates2). Data bias arising from uneven coverage of materials families may also be mitigated by entropy-targeted active learning.13
Alternative methods for defining OOD splits without relying on the feature space include using (i) target property ranges, (ii) time or date thresholds when data was added, or (iii) general materials information, such as structure, chemistry, or prototype/class. Splits based on target-property-sorted data14 can facilitate the discovery of materials with extraordinary target properties7 and has also been used in “k-fold forward CV”.15 Splitting datasets based on when data points were added mimics acquiring new, unseen data that may be realistically considered OOD.14,16,17 Lastly, the OOD generalization has recently been studied for formation energy models with structural and chemical hold-outs.18
To further encourage standardized reporting of these types of detailed insights into generalization performance and limitations of ML-based models in the materials sciences, here we provide “MatFold” as a featurization-agnostic programmatic tool for automatically generating CV splits for arbitrary materials datasets and model architectures, such as structure-based19 or composition-based20 models. Specifically, we propose a standardized series of CV splits based on increasingly difficult chemical/structural hold-out criteria, dataset size reduction, nested vs. non-nested splits, and others. By assessing model performance across various combinations of MatFold splitting criteria, one could, for example, more fairly compare the performance of differing approaches with the same modeling objectives. This approach allows for a better understanding of how well models' predictions generalize with increasingly difficult chemical or structural hold-out criteria. Additionally, it can determine the expected model improvement with continued data acquisition and assess whether this improvement depends on the splitting criteria used for OOD generalization. Furthermore, the method evaluates whether nested CV ensembles enhance OOD predictions and quantifies the extent of this improvement. It also examines the reliability of uncertainty estimates derived from nested CV ensembles and whether this reliability varies based on the splitting criteria used for assessing generalization.
For practically demonstrating the utility of insights derived from MatFold, we select ML exemplars from our previous work (modeling vacancy formation energies21 and surface work functions22). These are examples in structure-based ML where data leakage can be very problematic since multiple training examples are derived from the same base crystal structure. For example, many structures may contain vacancy sites that are determined to be unique but are in fact nearly identical because they are only slightly above the symmetry tolerance. Similarly, Miller surfaces from the same base crystal structure may be extremely similar. In either exemplar, the expected model error for inference (i.e., materials screening) can vary by factors of 2–3, depending on the splitting criteria. For comparison we additionally include a standard benchmarking exemplar from MatBench9 (bulk modulus). Through detailed insights into expected model performance in these exemplars and how it compares/differs across various splitting criteria, dataset sizes, and the exemplars themselves, we motivate MatFold as an easy-to-use, lightweight, and open-source tool for the materials ML community to enable the generation of reproducible data splits that deliver greater insights into model generalizability, improvability, and uncertainty.
Based on the user's choices of D, T, K, L, and CK, MatFold can typically create thousands of splits. The feasibility of training this many models may depend on the dataset size and modeling approach and may be less feasible, for example, in the training of recently developed universal ML potentials.23–26 However, dataset and model sizes are often small enough for more specialized ML-based materials discovery models to perform splits across at least some subset of the criteria summarized in Table 1. Subsequent exemplars based on our previous work (modeling vacancy defect formation energies21 and surface work functions22) and a standard MatBench example (modeling bulk modulus9), we are able to train thousands of model splits generated by MatFold to obtain improved insights into our models' generalizability and limitations. An overview of the ΔHV, ϕ, and log(KVRH) datasets and the chosen MatFold split protocols for each are listed in Table 2. Model hyperparameters are fixed at the optimal conditions as determined in the respective previous work.9,21,22
Vacancy formation energy (ΔHV) | Work function (ϕ) | Bulk modulus log(KVRH) | |
---|---|---|---|
# Data points | 1670 | 58![]() |
10![]() |
# Of unique: | |||
Structures | 250 | 3716 | 10![]() |
Compositions | 230 | 3623 | 9321 |
Chemsys | 114 | 2832 | 6946 |
Space groups | 35 | 62 | 173 |
Elements | 18 | 77 | 78 |
〈target〉 | 5.8 eV | 3.92 eV | 1.88![]() |
σ(target) | 3.5 eV | 0.86 eV | 0.37![]() |
Model type | dGNN | RF | {CGCNN,RF} |
D | {0.1, 0.25, 0.5, 0.75, 1.0} | {0.05, 0.1, 0.25, 0.5, 1.0} | 1.0 |
T | {None, binary} | None | None |
S | {K, (K, L)} | {K, (K, L)} | (K, L) |
CK | |||
Random | K = 10, L = 10 | K = 10, L = 10 | — |
Structure | K = 10, L = 10 | K = 10, L = 10 | K = 5, L = 5 |
Composition | K = 10, L = 10 | K = 10, L = 10 | — |
Chemsys | K = 10, L = 10 | K = 10, L = 10 | — |
Elements | K = LOO, L = 10 | K = LOO, L = LOO | K = {LOO, 5}, L = 5 |
PT group | — | K = LOO, L = LOO | — |
Space group | K = 10, L = 10 | K = 10, L = 10 | K = 5, L = 5 |
Point group | — | K = LOO, L = LOO | — |
Crystal Sys | — | K = LOO, L = LOO | — |
CL | Random | CK | Random |
# Total splits | 7480 | 4046 | 930 |
To evaluate the model performances, we denote the mean absolute error of an outer test set MAE = 1/Nk∑i|i − pi|, where Nk is the number of samples in the outer fold k,
i is the model prediction of sample i, and pi the truth value. The expected model performance is given as the ensemble average over the set of all K folds, 〈{MAE}K〉. For non-nested CV, i.e., K-fold,
i in the kth test set is predicted by a single model trained on the kth train set. For nested CV, i.e., (K, L)-fold, the final prediction is the ensemble average over the set of inner model predictions on the outer test set, 〈{
i}L〉. The deviation of that ensemble average prediction from the true value is referred to as residuals, calculated as |pi − 〈{
i}L〉|. Importantly, nested CV also yields an uncertainty metric via the standard deviation over the set of inner model predictions on the outer test set, σ({
i}L).
We note that for datasets with strong imbalances in splitting labels (e.g., an element present in almost the entire dataset vs. another element being present only in a tiny fraction of the dataset) the MAE and its standard deviation may be affected by the random seed during split generation. This can be mitigated in MatFold by specifying a minimum and maximum threshold of split label prevalence that determines whether that label is considered during the CV procedure or is always enforced to be in the training set. For example, if oxygen is present in 90% of structures in the dataset and the user specifies a maximum threshold of 0.9, then oxygen-containing structures will be part of the training set by default during CV.
![]() | ||
Fig. 2 ΔHV test set parity plots for D = 1.0 and various CK criteria for (a) T = None and S = (K) and (b) T = Binary and S = (K, L). (c) Expected MAE for various split criteria and combinations of other MatFold options such as T = {None, Binary} and S = {K, (K, L)}. Quartile box plot of MAE and R2 values are shown in ESI Fig. 5.† (d) Correlation of residual vs. standard deviation of ensemble predictions (purple circles), with R2 shown in the inset. Additionally, within individual bins for σ({ΔĤV}L), the average and standard deviation of residuals in that bin are shown with white circles and red error bars, respectively. The cyan line represents y = x. |
Fig. 2a shows density parity plots of all outer test set predictions for CK = {Random, Structure, Composition, Chemsys, SG#, Elements} and D = 1.0, T = None, and SK-fold, while Fig. 2b shows the same but for T = Binary and S
(K, L)-fold. The color code is on a logarithmic scale with respect to the number of predictions at that grid point. Note that for this dataset, we are able to compute all ΔHV for at least one of each binary oxide in the chemical space of interest, motivating the investigation of automatically assigning binaries to the training data. Immediately noticeable in Fig. 2b is the mitigation of over-fitting for some of the largest outliers observed in Fig. 2a. Additional analysis in the ESI,† applicable only to this exemplar, investigates the CK = Elements parity plots at a more granular level and further reveals insights into the generalization capabilities of dGNN.
Fig. 2c further quantifies the dependence of the expected model error as a function of T, S, and CK and highlights key conclusions. The expected MAE is strongly influenced by the splitting criteria and generally increases with Random < Structure < Composition < Chemsys < SG# ≪ Elements, where error bars correspond to σ({MAE}K). Several key conclusions arise. For this particular dataset and model, using a single training model for inference (blue bars) generally produces an expected MAE ∼10–20% higher than using the ensemble of models from nested CV (green bars) across all CK. From a different perspective, one would overestimate the expected MAEs by ∼10–20% if using the ensemble of non-nested K-fold models to perform inference for materials screening exercises, compared to the MAEs calculated by nested (K, L)-folds. Further assigning all binaries to the train set generally helps but has less of an impact than model ensembling.
More importantly, the choice of CK has a very strong influence on the expected MAE. The goal of this and many other specialty ML models for materials discovery, trained on small-to medium-sized datasets (∼100s–1000s of examples), is to screen properties of structures that represent entirely new compositions, or even chemical systems, that are outside the training data. For this use case, performing a purely random split introduces substantial data leakage which leads to a ∼30% underestimation of the expected MAE when, for example, predicting defects in a structure that represents an unseen chemical system in the training data. As an even more extreme example, CK = Elements reveals a ∼2.5 times higher expected MAE than a purely random split, although ensembling can reduce expected MAE by ∼30% relative to a non-ensemble prediction.
Fig. 2d confirms that the standard deviation of predictions over model ensembles is a useful uncertainty metric29,30 in this modeling application, but with some limitations. The individual residuals for any given test prediction (blue circles) are only very weakly correlated with σ({ΔĤV}L). However, computing the average and standard deviation of residuals within a given bin of σ({ΔĤV}L) (red markers and error bars, respectively) collapses the data onto the y = x parity line (cyan). Therefore, on average a low σ({ΔĤV}L) is correlated with a low residual, but there is a non-negligible probability of individual predictions with very large residuals despite low uncertainty.
The final key insight from the MatFold analysis stems from the dependence of expected MAE on both CK and D. Fig. 4 plots expected MAE for D = {0.1, 0.25, 0.5, 0.75, 1.0}, expressed on the x-axis in units of number of defect examples in the training data. Data leakage and underestimation of expected MAE are even more pronounced for the smallest dataset, and the rapid plateauing of the expected MAE with increasing data is potentially indicative of the absolute accuracy limit of the model. For more realistic screening criteria, i.e., CK = {Composition, Structure, Chemsys}, large accuracy gains are and will continue to be obtained with increasing data collection. Interestingly (and perhaps intuitively), for CK = Chemsys the improvement qualitatively appears to be saturating before the other criteria, but will only be confirmed with additional data collection. Finally, CK = Elements reveals that additional data collection does not increase the accuracy during inference on compounds containing unseen elements. In fact, the error slightly increased intermittently with additional data collection because it may have introduced compounds with new test set elements which are even more difficult to extrapolate to from the elements contained in the train set (see ESI† for more details).
We utilize similar split possibilities as for the defect dataset (see Table 2), except we do not automatically assign binary compounds to the training set (i.e., here we use only T = None). Moreover, for the LOO splits we excluded labels that make up less than 5% or more than 40% of the whole dataset to avoid unbalanced test set sizes (the user can specify these thresholds in MatFold). A total of 4046 unique splits were generated. As discussed in the previous section and Fig. 2 for the defect dataset, Fig. 3a and b show density parity plots of all outer test set (D = 1.0) predictions for CK = {Random, Structure, Composition, Chemsys, SG#, Elements} for non-nested S = K-fold and nested S = (K, L)-fold, respectively.
![]() | ||
Fig. 3 Parity plots of DFT-calculated vs. ML-predicted work functions are shown for (a) K-fold and (b) nested (K, L)-fold splits for different splitting strategies. The color scale is on a logarithmic scale w.r.t. the number of structures at that grid point. The corresponding MAEs are displayed in (c) for K-fold and nested (K, L)-fold splits in green and orange, respectively. Quartile box plot of MAE and R2 values are shown in ESI Fig. 7.† The residuals, |ϕ − 〈{![]() |
Fig. 3c summarizes the MAEs for the parity plots displayed in (a) and (b). Unlike the defect dataset, the MAEs and their standard deviations for the work function dataset are very similar between the non-nested and nested splitting strategies. This likely stems from the GNN-based model being more prone to overfitting compared to the 15-feature RF model. Hence, the GNN model benefits more from statistical averaging during nested splitting. Like the defect dataset, the MAEs increase in the order Random < Structure < Composition < Chemsys < Elements. However, an interesting difference is that the SG# split exhibits the highest MAE, less than the MAE for the Elements split (219 and 149 meV, respectively for nested splits). Compared to the MAE of the random split this is an increase of 133% and 59%, respectively. This agrees well with the RF model features being largely comprised of elemental properties (e.g., electron affinity) while containing little structural information. The work function model generalizes better outside the elemental training distribution and worse outside the structural training distribution. Among all splits that leave one element out, the MAEs are significantly larger for holding out F, H, O, or Cl (1178, 959, 708, 657 meV, respectively; cf. periodic table heat map in ESI Fig. 8†). These elements typically exhibit complex chemical behavior that may not be well captured in other chemistries. Compared to random splitting the MAE (94 meV) increases by only ∼17% for structural, compositional, and chemical systems splitting (all three have an MAE of ∼110 meV for nested splits). This surprisingly small increase in MAE may be explained by the work function strongly depending on the element present in the topmost surface layer – hence, as long as an element is present in any chemical system (or composition) in the train set, the RF model is able to learn the elemental trend for the work function and can then extrapolate well for an unseen chemical system. The average MAE increases (218 meV) by holding out groups of the periodic table compared to holding out just Elements (149 meV). Similarly, the MAEs increase by holding out point groups (227 meV) and crystal systems (272 meV) compared to just holding out space groups (219 meV). ESI Fig. 6† displays the parity plots, MAE trends, and residuals for these additional hold-out strategies.
Similar to Fig. 2d, the residuals |ϕ – 〈{L}〉| are plotted against the standard deviation of the work function predictions over model ensembles in Fig. 3d. Interestingly, the averages of the residuals within a given bin of σ({
}L) (red markers) tend to have a slightly greater slope than the x = y parity line (cyan). The overconfidence of this bootstrapped uncertainty metric appears to be typical of tree-based models using hand-engineered features and therefore requires re-calibration3 such that the expectation value of the residual for a given σ bin is closer to parity.
Fig. 4b shows the dataset size dependence of the MAEs and their standard deviations for the work function dataset. A roughly linear decrease in the MAEs is observed with a logarithmic increase in the dataset size for splitting strategies CK = {Random, Structure, Composition, and Chemsys}. The standard deviations of the MAEs typically decrease with increasing dataset size. However, for splitting strategies CK = {SG#, Elements}, the MAEs start to plateau with increasing data size, indicating that additional data may no longer improve the RF model's capability to infer OOD samples accurately.
Fig. 5 shows (K = 5, L = 5)-fold test set predictions for CK = {Structure, SG#, Elements} and (K = LOO, L = 5)-fold test predictions for CK = {Elements}. Fig. 5a and b show the difference in performance between a GNN approach (using a crystal graph convolutional neural network, CGCNN) and an RF approach (using Magpie features), as detailed in MatBench.9 Compared to the CK = Structure baseline, the GNN generalizes much better to unseen structural motifs (CK = SG#) than unseen elements (CK = Elements) with a 9% vs. 41% increase, respectively, which is qualitatively consistent with observations in the ΔHV exemplar. The RF model shows an even larger 18% MAE increase for CK = SG# compared to its own CK = Structure baseline. For either modeling approach, the CK = Elements (LOO) have an 〈{MAE}K〉 that is notably larger than the CK = Structure baseline, but the median test set error is comparable, driven by a very large skew for some elemental test sets that are very poorly predicted.
Fig. 5b demonstrates that inner nested ensembles of structure-based models (GNN) produce an uncertainty metric that is close to parity with the residual on average (i.e., 〈Residual|σ〉 ≈ σ). While for a composition-based model, this metric tends to be overconfident (i.e., 〈Residual|σ〉 > σ). Interestingly, this observation is consistent with the previous exemplars, despite the significant differences between all three ΔHV, ϕ, and log(KVRH) datasets and modeling tasks. Nonetheless the uncertainty metric remains only very weakly correlated with the individual residuals (cf. R2 values stated in the respective Figures).
Finally, Fig. 5c presents a parity plot of the residuals from CK = {Element (LOO), SG#} vs. the residuals from CK = Structure for each dataset and models corresponding to D = 1.0, T = None, and S = (K, L). Notably, there can be strong correlations (R2 > 0.7) between them: many large prediction errors are simply because the structure itself is OOD with respect to all other structures in the dataset and does not specifically arise from being OOD with respect to the stricter splitting criteria. However, models/datasets with the lowest R2 values arise because the test set residuals for CK = Element (LOO) or CK = SG# are much larger than the residuals for CK = Structure. Here the stricter splitting criteria is enforcing the OOD test prediction, which otherwise is trivial because the structure is too similar to another in the dataset.
Some intuitive similarities and differences in MatFold CV trends can be observed between different modeling approaches and data domains, as demonstrated in our three exemplars. As expected in these exemplars, purely random splits provide the most biased underestimation of expected MAE, but the evolution of expected MAE with increasingly strict splitting criteria is heavily dependent on the modeling approach and data domain. For GNN's predictions of ΔHV (a direct crystal structure input model), the expected MAE on structures with unseen elements is nearly double that of structures with unseen space groups. Yet the opposite is true for RF predictions of ϕ (a hand-engineered feature input model). Therefore these ΔHV GNN models generalize better to unseen structural motifs than unseen chemistry, the exact opposite of the ϕ RF models. Trained on smaller datasets and even more sensitive to local structural features (e.g., when a vacancy defect site itself corresponds to the held out element), these local property models' expected error showed higher sensitivity to the splitting criteria than a global property model trained on a much larger dataset (the MatBench bulk modulus). This further highlights the need for performing the comprehensive and automated CV splitting analysis enabled by MatFold in any given study to gain a detailed and unbiased perspective on a materials discovery model's generalization limitations.
The ΔHV GNN predictions also benefit substantially from bootstrapped model ensembling to reduce over-fitting and mitigate outliers in test set prediction parity, while no benefit is observed in the ϕ RF models. Consequently, we observed the need to re-calibrate the bootstrapped uncertainty metric derived for the ϕ RF models, but not for the ΔHV GNN models. This observation also holds true for both GNN vs. RF models trained for the same global property log(KVRH) modeling task. It should be noted that re-tuning the hyperparameters during model ensembling could further reduce over-fitting but comes at a large computational cost (e.g., tuning 2 hyperparameters with 10 possible values each would already require training 100 times more models). Finally, in both ΔHV and ϕ exemplars, we generally observe continued improvement in model performance with more training data for moderately difficult OOD inference (e.g., structure, composition, or chemsys splits). However, for their weakest inference task (Elements for ΔHV GNN and SG# for ϕ RF models), neither is likely to improve further with additional data indicating fundamental limitations of the respective model architectures.
We anticipate that the splitting criteria and other functionality introduced by MatFold will lower the bar for better and more automated CV of data-driven materials models. Practitioners will have a better understanding of their expected accuracy for materials discovery in increasingly difficult OOD inference, regardless of their modeling approach because MatFold CV splits are only material dependent and entirely material featurization-agnostic. This will also enable deeper insights of materials discovery performance spanning differing modeling approaches and data domains and, if widely adopted, provide more grounded evidence for which modeling approaches may be more appropriate in various materials discovery situations.
Footnote |
† Electronic supplementary information (ESI) available: The ΔHV dataset is provided in the supplementary_files_defects.zip. Additional CV analysis showing inference performance for additional hold-out strategies. MAE heatmaps and parity plots for leave-one-element-out splits. Quartile boxplots of model performance metrics. See DOI: https://doi.org/10.1039/d4dd00250d |
This journal is © The Royal Society of Chemistry 2025 |