Daniel
Speckhard
*ab,
Tim
Bechtel
ab,
Luca M.
Ghiringhelli
c,
Martin
Kuban
a,
Santiago
Rigamonti
a and
Claudia
Draxl
ab
aPhysics Department and CSMB, Humboldt-Universität zu Berlin, Zum Großen Windkanal 2, 12489 Berlin, Germany. E-mail: claudia.draxl@physik.hu-berlin.de; dts@physik.huberlin.de; Fax: +49 2093 66361; Tel: +49 2093 66363
bMax Planck Institute for Solid State Research, Heisenbergstraaae 1, 70569 Stuttgart, Germany
cDepartment of Materials Science and Engineering, Friedrich-Alexander Universität Erlangen-Nürnberg, Dr.-Mack-Str. 77, 90762 Fürth, Germany
First published on 11th July 2024
Big data has ushered in a new wave of predictive power using machine-learning models. In this work, we assess what big means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.
Artificial intelligence is arguably the most rapidly emerging topic in various research domains, with increasing impact in materials science as well. The success of AI approaches, however, strongly depends on the amount and quality of the underlying training data. In this context, the first obvious question is how large a dataset needs to be in order to provide sufficient information for the research problem to be solved. Are millions of scientific results, as available in international databases, a gold mine? Or do these collections still not provide enough significant data for a specific question? Or can we even learn from small datasets? Are large, multipurpose databases sufficient for training models, or to what extent are dedicated datasets needed for this task? Can errors be controlled when using data from different sources? Can we learn from experimental and theoretical data together, or are even different theoretical or experimental data by themselves too heterogeneous to produce reasonable predictions?
What is useful for a particular learning task may depend heavily on the research question, the quantity to be learned, and the methodology employed. This concerns data quality, data interoperability – especially when data from different sources are brought together – data veracity, and data volume (the last two are part of the “Four V’s of big data”1). In this work, we ask the question: “What does big mean in the realm of materials science data?” There are many issues related to this, demonstrated by a few examples: first, some methods are more data hungry than others. While, for instance, symbolic regressors may lead to reasonable results already for a small data set,2 neural networks (NNs) may require many more data points3 for training to be stable. Second, even with what can be considered big datasets, training models that generalize to similar datasets is not trivial. Third, data quality may have a significant impact. While high-quality data may give one a clear picture from the very beginning, many more noisy data may need to be accumulated for obtaining robust results, to avoid wrong conclusions and/or allow for physical interpretation. Finally, what does big data mean for data infrastructure? What are the requirements for processing and storing big datasets; how computationally intensive is training ML models on them?
In this work, we address such questions and evaluate the performance of different AI models and tools in terms of data volume, diversity and quality, and provide a quantitative analysis. As the overall topic is very wide and diverse, we aim to draw the reader’s attention to it and initiate discussions rather than to provide detailed solutions to all of the related issues. The chosen examples should serve this purpose. In the first one, we ask whether a model obtained from a message-passing neural network (MPNN) can be transferred to a similar dataset. Second, similarity metrics and sorting are applied to understand data quality in terms of the convergence behavior of density-functional theory (DFT) data. Third, the effect of an expanded feature set on the expressivity of cluster expansion is demonstrated. Fourth, the complexity of different model classes is explored in terms of their performance. Finally, infrastructure requirements for creating large materials datasets and for training models with many parameters are quantitatively analyzed.
The AFLOW dataset is chosen as described in ref. 8 and filtered to use only calculations performed with the PBE functional, which contain both bandgap and formation energy. The MP dataset is retrieved from ref. 9 (dataset snapshot denoted as MP-crystals-2018.6.1). After filtering, the two datasets comprise 61898 and 60289 materials, respectively. We proceed by assigning unique identifiers to the structures in the AFLOW and MP datasets. They consist of the chemical formula concatenated with the space group of the crystal, similar to ref. 10, to define which structures are common to both databases and which are unique to either of them. This results in 50652 unique composition–spacegroup pairs in the AFLOW dataset and 54438 in the MP dataset, with 8591 pairs being common to both datasets.
To avoid leakage of information from the training set to the test set, we ensure that the same common structures are present in the training-test splits of both datasets. This means that a model trained on AFLOW data will not be used for inference on MP test data on the same structure it was trained on, and vice versa. For instance, all structures with composition Mg2F4 and spacegroup 136 get the label Mg2F4_136. If there are several of these structures, e.g., with different lattice parameters, all of them get assigned to one split, i.e., training, validation, or test.
We train MPNNs with edge updates, as described in ref. 11, on the two individual datasets as well as the combination thereof. Previously, we have shown that the hyperparameters that define the architecture of the MPNN transfer well to different prediction tasks.8 Following this, we perform no further hyperparameter optimization. The performance of the models is shown in Fig. 1. The plots along the diagonal show model performance when training and test data are from the same database. Let us focus first on AFLOW and MP data (2 × 2 submatrix on the bottom left). Both models trained and evaluated on the individual databases are very good, with mean absolute errors (MAEs) of 30 meV per atom and 23 meV per atom, respectively. However, the errors in the prediction of the other database are an order of magnitude larger. Notably, there is also a clear trend of too large (small) predictions in the MP-model predictions on AFLOW (AFLOW-model predictions on MP) data. This points to a systematic offset between the energies of the two datasets. Indeed, we can partly trace this discrepancy back to the relatively small overlap of the spacegroup–composition pairs in the two databases, which is about 16% only. Further analysis reveals that MP data contain many more materials with lower formation energies than the AFLOW data, as evident from the top panel of Fig. 2. The distribution of the shared structures (bottom panel of Fig. 2) indicates that the latter have overall a smaller impact, i.e., do not lead to a systematic offset. However, we see that there are many more entries of the same structures, e.g., with different lattice parameters, in the AFLOW dataset. For example, Ti2O4 (spacegroup 136) and BaTiO3 (spacegroup 221) occur 127 and 115 times, respectively, while only one of each compound is found in the MP data.
The lack of interoperability of the two datasets is also demonstrated when training a model on the combined dataset (top right panel in Fig. 1). Although there is an improvement over the models transferred to the respective other individual dataset, it performs worse than the models that are trained and tested on either of them (bottom left, center middle). We assign the discrepancy to differences in computational details, such as Brillouin-zone (BZ) sampling, basis-set cutoff, convergence criteria, etc.
Even for what we might consider to be big datasets in materials science, and using a rather simple ML target, this example shows us that the models we train are only valid for the data they were trained on and struggle to generalize. In our specific example, the AFLOW and MP datasets turn out to sample the underlying material space differently, since the MP materials appear biased towards lower formation energies. We conclude that these two databases are not “big” enough in the sense that they are not diverse enough to be able to make predictions across a wider range of diversity. Training on the combined dataset, the predictions are less biased, i.e., the errors are more symmetric with respect to the diagonal. However, the MAE is higher overall as compared to the individual models. This indicates that also differences in computational settings may play a role and may need to be considered as input features.
To illustrate this, we make use of data from the NOMAD data collection.16,17 These calculations were carried out with the DFT code FHI-aims18 as part of a systematic study of the impact of computational parameters on DFT results.19 For our analysis, we use the calculations for hexagonal boron nitride (h-BN), specifically, the DOS of ground-state calculations obtained with different basis-set sizes and k-point samplings at the experimentally determined equilibrium volume. To compare the results of these calculations, we compute the spectral fingerprints20 of the DOS in the energy range between −10 and 10 eV around the valence band maximum. In Fig. 3, it is exemplified how the similarity between two spectra is obtained. The left panels show the DOS obtained by two calculations with different numbers of basis functions (Nb), but otherwise identical settings. They are converted into spectral fingerprints (middle panels), i.e., raster-image representations of the original spectra. The similarity (right panel) is calculated using the Tanimoto coefficient,21 which is the overlap of the areas covered by the individual fingerprints (red and blue) divided by the total area covered by both of them (purple). The similarity matrix shown in Fig. 4 contains all pairwise similarities. The rows and columns are sorted by increasing numerical settings, separately for two xc functionals, the local density approximation (LDA, indices ≤70) and PBE (indices >70). The bottom panel shows the number of k-points (Nkpt) and Nb. For each k-point mesh (plateau in the k-mesh), the basis set is increased in the same way. Additionally, the data are sorted by a set of numerical settings, called “light”, “tight”, and “really tight” in FHI-aims. Finally, consecutive calculations with otherwise identical settings vary by the relativistic approximation employed for core electrons, i.e., “ZORA”, and “atomic ZORA”.18 Sorting the matrix in this way, we see patterns appearing in the figure, which we discuss below. As a guide to the eye, dotted lines inside the matrix indicate sets of calculations with the same number of k-points. To simplify the discussion of the results, individual blocks are labeled with letters.
Focusing first on the convergence of the LDA data (indices i ≤ 70), we observe a clear block structure, where the calculations of the first set, i.e., those with the lowest Nkpt (index i ≤ 23) are most dissimilar to all others (blocks d and f in Fig. 4), and also to each other (block a). However, they are pairwise similar, indicating that the relativistic approximation plays a minor role in the convergence of the DOS. Sets of calculations with medium (24 ≤ i ≤ 47) and high (48 ≤ i ≤ 70) numbers of k-points generally show higher similarity among themselves (blocks b and c). Noticeably, calculations with a low (medium and high) number of basis functions Nb are similar among different BZ samplings (block e), but less similar to calculations with higher (lower) Nb. For PBE calculations i ≥ 71, the convergence behavior is somewhat different: Already calculations with low Nkpt reach medium similarity to calculations with high Nkpt (j and l). Calculations with the same Nb are similar to each other, already for low Nkpt (block g). Similar to the LDA case, PBE calculations with medium Nkpt reach high similarity to those with highest k-point density (block k). In the remaining diagonal blocks of the PBE data, i.e., blocks h and i, it can be seen that the similarity between calculations with the same Nb is high and varies only slightly. Further comparing the similarity of LDA and PBE convergence, i.e., the off-diagonal blocks m and n of the matrix (indices i ≥ 71 on the x-axis and i ≤ 72 on the y-axis), we find a pattern corresponding to calculations with the same number of basis functions. It shows that calculations with different xc functionals are more similar to each other if the same Nb is used.
Our example illustrates the veracity of materials data arising from the large combinatorial space of approximations and numerical parameters employed in DFT codes. While LDA and PBE, both being semi-local functionals, are generally expected to give similar results in terms of the electronic structure, we show here that the convergence behavior with the number of k-points and basis functions is slightly different. The visualization of DOS similarity matrices allows one to understand and quantify differences between DFT data computed with different computational settings. It also enables the creation of rules to group data in order to gather “homogeneous” subgroups within a dataset.
(1) |
(2) |
In this section, we want to explore the expressiveness of the feature spaces in standard and nonlinear CE. Assuming that the intrinsic error in the data to be modeled is small, does our feature space allow us to approximate the ground truth? We use a dataset of 235 oxide perovskites with composition BaSnO3−x, with x < 1 being the number of oxygen vacancies per formula unit. BaSnO3 has significant potential for use in photocatalytic and optoelectronic applications. Its electronic structure demonstrates a complex dependence on the concentration and configuration of oxygen vacancies. The bandgap varies strongly, even among structures with identical concentration and comparable energies of formation. Thus, modeling the bandgap is challenging for linear regression models. For this learning task, the clusters account for specific structural patterns of O vacancies.
As shown in Fig. 5, standard CE models, obtained via orthogonal matching pursuit (OMP) from a pool of p = 27 clusters, yield fit root-mean-squared errors (RMSEs) larger than 250 meV (blue line, top-left panel). The best fit is obtained, as expected, when using all 27 features (marked by a vertical dashed line). For the pool of p = 27 clusters, we explore polynomials of degree d = 2 and d = 3, containing 405 and 4059 features, respectively, obtaining models with up to 200 features using OMP. These are shown by the orange and red lines in the top-left panel of Fig. 5. In addition to the expected monotonic reduction of the error, two notable findings are obtained: first, with 27 features, the nonlinear feature space yields better models than the standard CE with the same number of 27 clusters. This is remarkable because the nonlinear features of the nonlinear CE models are derived solely from the 27 cluster correlations used in the standard CE model. Related to this point, we also observe that for the same number of features, degree 3 always gives smaller error than degree 2. Second, both nonlinear CE models reach a point where no further improvement is obtained upon increasing the number of features. This is observed as a plateau above ∼150 features for d = 3 and above ∼180 features for d = 2. The plateau for d = 2 is reached at a higher level of the RMSE than for d = 3. Thus, under the assumption that the intrinsic error is small, increasing the complexity of the feature space allows the CE to better approach the underlying ground truth. In practice, increasing the complexity of the feature space allows CE to obtain the same accuracy with sparser models as is obtainable with less complex feature spaces. This is important for model selection methods that favor sparsity, such as LASSO. The top-right panel of Fig. 5 shows the quality of the predictions for the whole dataset using the best possible model for standard CE (blue dots) and the best possible model for d = 3 nonlinear CE, based on the same 27 cluster correlations. The improvement of the nonlinear modeling is remarkable, especially for the difficult case of metals (the set of materials with Eg = 0).
Now, the question arises whether the beneficial effect of nonlinear features could be also obtained by adding more cluster correlations, that is, accounting for more structural patterns of vacancies. For this, we have created a pool of 176 clusters. The result of standard CE is shown in the lower left panel of Fig. 5 (blue line). Like before, we see a plateau setting in at about 110 clusters with a RMSE of ∼125 meV, and no further improvement can be achieved upon adding more clusters. The vertical dashed line indicates the model with the largest possible number of clusters in this pool. Again, the second-order nonlinear CE based on this same pool allows for obtaining better fits at all numbers of features. A comparison of the quality of the fits for standard and nonlinear CE models with 150 clusters, respectively, is shown in the lower-right panel of the figure. Similarly to what was obtained for the small pool of 27 clusters, the nonlinear CE yields significantly better predictions.
From this example, we see the feature set can have a strong impact on the expressiveness of the CE model. Including nonlinear terms enables the CE to better fit the underlying ground truth even with a small data set size. This benefit appears irrespective of the number of cluster correlations. This improvement motivates us to try to better define model complexity, which is done in the following section.
(3) |
The SISSO model first combines primary features into generated features using a set of mathematical operations (e.g., +, −, exp(), √,..., and combinations thereof).25,27 The number of features that are selected by the algorithm is called the dimension of the model. We can interpret this in terms of a (symbolic) tree that describes a generated feature, where each split in the tree is an operation and the leaves are the primary features.27 The larger the depth of the tree, the more operations there are in the generated-feature space. The maximum tree depth is called the rung of the model. In essence, the number of selected generated features (model dimension) defines how many coefficients the model has. One can also include a single constant bias term in SISSO. With a larger rung value, the possible secondary features explode combinatorially. Therefore, we can define a descriptor for the complexity in SISSO with a two-dimensional vector, comprising the rung and the dimension of the model (counting the bias term if present as an extra dimension).
How can we define complexity for RFs? RFs are piecewise constant functions. For a regressor, each leaf node in each decision tree learns a constant bias term. For each split in each decision tree, a value of the variable to be split on must be learned. Since the trees are binary, the number of leaf nodes is one greater than the number of splits. Therefore, the number of leaf nodes is an integer that tells us the number of trainable parameters in the model and describes the complexity of the model.
From each of these three model types, we have presented a measure of model complexity. The model complexity for each model type is described with a different number/vector. For NNs, it is a two-dimensional vector of the total number of weights and biases; for SISSO, it is a two-dimensional vector reporting the rung and dimension of the model, while for RFs, it is the number of leaves in the model. The definitions are not unique, but offer an idea of how well the model can learn to approximate a wide variety of functions.
In the second example, an NN and an RF are trained to predict bandgaps on the AFLOW dataset of ref. 8. Features describing the elemental composition of the materials, as described in ref. 29, are fed into both models. For the same parameter count, one can argue that an NN with the ReLU activation function is more complex than the RF, since the RF model is piecewise constant and the NN is piecewise linear. However, both NNs and RFs are universal approximators, meaning that with enough splits/nodes, they can approximate any probability distribution. On the AFLOW bandgap dataset, after cross-validation, the RF outperforms the NN, with test MAEs of 469 meV and 515 meV, respectively. This trend holds when we restrict the total number of parameters of both models to be similar. Once again, model class, not model complexity, is the decisive factor of model performance.
We can conclude from both examples that the number of trainable parameters alone tells us very little about model performance. Rather, we find that for a given dataset, the performance depends on the model class.
C(h) = a × A(h) + b × L(h) + c × N(h) | (4) |
These examples demonstrate that the total parameter count is not sufficiently descriptive of the model’s ability to learn to approximate a wide variety of functions for many model classes. We offer an alternative metric for model complexity as a weighted sum of constant, linear, and nonlinear terms. We conclude, however, that the meaning of complexity is model dependent. This motivates us to therefore only talk about complexity within a single model class. More work is needed to explore these topics more carefully from an information theoretic point of view.
Such workflows are typically executed on supercomputers, which impose limits on users concerning disk space and numbers of files. Storage limits are, e.g., on the order of a few 100 GB in backed-up directories, or a few TB in scratch directories. There are also limits in terms of the number of files that can be stored (often at the most 1 million). Both limits often mean that the users need to compress their data periodically and transfer their files to a storage system outside of the supercomputer network. In this example, the data are transferred to a NOMAD Oasis.35,36 NOMAD Oasis is the same software as in the public NOMAD data infrastructure,37 including, among other tools, parsing, normalizing, electronic notebooks, and the NOMAD AI toolkit.38 The software can be installed locally by any research group and can be configured to their needs. Calculated data, as described above, are stored into an interoperable format.39,40
In summary, calculating big data presents its own unique infrastructure challenges. The workflow example shown here quantitatively demonstrates that performing high-throughput DFT calculations requires sophisticated hardware/software solutions to navigate HPC file and storage restrictions. The NOMAD Oasis is one such solution to this problem, which also allows for making data ready for ML purposes.
(5) |
Fig. 7 shows how the performance of the MPNN architecture described in ref. 11 and trained on AFLOW bandgaps depends on these two parameters. Here, the NAS reward function is the negative of the validation RMSE. Each data point in the figure represents an individual MPNN architecture. The z-axis represents this architecture’s best validation RMSE across different initial learning rates, learning decay rates, batch sizes, dropouts, and readout functions. We see that the validation RMSE as a function of the two architecture parameters is non-convex. This non-convexity is what makes the optimization of the architecture slow and why black-box optimization methods such as reinforcement learning, random search, or evolutionary search are necessary.46 Whether the NAS reward function is non-convex depends, of course, on the dataset and the search space. The mean (median) validation RMSE of a NAS candidate architecture across the entire space, not just these two parameters, is 556 meV (553 meV). The best candidate, which has a latent size of 128 and uses 3 MP steps, has a validation RMSE of 474 meV. This means that the optimal NAS architecture yields an improvement of 80 meV over the mean architecture in our search space, or in relative terms, 14.7%. To give an example, for finding solar cell materials with a bandgap between 1.1 and 1.5 eV, an 80 meV improvement in the model prediction may not be vital but would certainly be desired.
Let us proceed to analyze the computing requirements for a concrete NAS example. The SchNet graph neural network has ca. 85000 trainable parameters for a latent size of 64 and three message-passing steps (when trained on a dataset with 81 different atomic elements).47 On a single NVIDIA V100 GPU, it takes about two hours to train 2000000 steps on the AFLOW bandgap dataset (batch size 32). A further breakdown of the time required per training step shows that the model takes 0.0010 seconds on average to batch the data and 0.0023 seconds to perform a gradient-update step. For the SchNet-derived NAS architecture search space, approximately 2000000 steps are needed to have an idea of whether a candidate architecture is promising or not. This means that to train 2000 architectures, approximately 40000 GPU hours for this dataset are needed, or in other words, more than 4.5 years on a single V100 GPU. Converting this number to dollars, current Amazon Web Services pricing plans fluctuate around 3 USD per NVIDIA V100 GPU hour.48 Thus, a budget of roughly 120000 USD is needed to perform this NAS. If we use a larger model than SchNet, such as PaiNN49 (600000 vs. 85000 parameters), the training time and subsequent GPU and dollar budget would increase significantly. Moreover, considering larger datasets than AFLOW (about 60000 data points, see Section 2), training would also slow down dramatically.
We can also compare CPU versus GPU training for this example. On a single CPU core, the training of 2000000 training steps of the SchNet model takes 10.5 hours, or on average approximately 0.0012 seconds each training step to batch the data, and 0.0177 seconds to take a gradient step and update the parameters. We conclude that CPU training is nearly an order of magnitude slower.
As NAS becomes more ubiquitous in materials science, as it has in other fields, new infrastructure challenges will have to be addressed. Finding the optimal model is often a non-convex task and can belong to large multidimensional search spaces that require black-box optimization methods. For models with a relatively small number of parameters, an MPNN NAS can require a huge computational and therefore monetary budget to be run on medium-sized to large datasets. As datasets and the base models being trained grow larger, so too will the infrastructure requirements.
Efforts to create big datasets will require advanced technical and software solutions. For example, computing only a fraction of the above mentioned datasets causes file and storage issues. In addition, as datasets grow, so does the number of model parameters. MPNN and other variants of graph neural networks have become quite common in materials science. Finding the best MPNN model requires a neural-architecture search over an often non-convex search space. GPUs offer an order of magnitude speedup in training, but using enough GPUs to perform a single NAS search over a large dataset requires a significant monetary budget.
Databases also pose challenges related to data veracity. For instance, data points computed for the same material may differ in the input settings. Sorting or clustering data using similarity metrics can help users better understand the quality of data. It also enables ML practitioners to filter data into homogeneous subsets that contain less variation in the target properties. Multi-fidelity modeling,13 where hetereogeneous data can be used by the model, is not discussed here, but would be an alternative option for dealing with data veracity.
Big data presents an opportunity to use complex models. Complexity is shown to be well defined within the confines of a single model type. A general definition for comparative purposes is, however, lacking. The total trainable parameter count, or the number of constant, linear, or nonlinear terms, can be less important than the model class when predicting performance. Increasing complexity, however, as shown with the example of using nonlinear features in cluster expansion, can aid significantly in describing intricate physics. More research is needed to better define model complexity in a data- and model-independent manner.
In this work, we attempt to shine light on some aspects of big data in materials science. There are certainly many more aspects to be explored. We hope that the issues we illustrate here will motivate further research along these lines.
This journal is © The Royal Society of Chemistry 2025 |