Kareem
Abdelmaqsoud
a,
Muhammed
Shuaibi
b,
Adeesh
Kolluru
a,
Raffaele
Cheula
ac and
John R.
Kitchin
*a
aDepartment of Chemical Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, Pennsylvania 15213, USA. E-mail: jkitchin@andrew.cmu.edu
bMeta Fundamental AI Research, Menlo Park, California 94025, USA
cDepartment of Physics and Astronomy, Aarhus University, Nordre Ringgade 1, 8000 Aarhus C, Denmark
First published on 14th August 2024
Machine learning potentials (MLPs) have greatly accelerated atomistic simulations for material discovery. The Open Catalyst 2020 (OC20) dataset is one of the largest datasets for training MLPs for heterogeneous catalysis. The mean absolute errors (MAE) of the MLPs on the energy target of the dataset have asymptotically approached about 0.2 eV over the past two years with increasingly sophisticated models. The errors were found to be imbalanced between the different material classes with non-metals having the highest errors. In this work, we investigate several potential sources for the imbalanced distribution of errors. We examined material class-specific convergence errors in the density functional theory (DFT) calculations including k-point sampling, plane wave cutoff and smearing width. Significant DFT convergence errors with a mean absolute value of ∼0.15 eV were found on the total energies of non-metals, higher than the other material classes. However, as a result of cancellation of errors, convergence errors on adsorption energies have a mean absolute value of ∼0.05 eV across all material classes. Moreover, we found that the MAEs of the MLPs are not affected by these convergence errors. Second, we show that calculations with surface reconstruction can introduce inconsistencies to the adsorption energy referencing scheme that cannot be fit by the MLPs. Nonmetals and halides were found to have the highest fraction of calculations with surface reconstructions. Removing calculations with surface reconstructions from the validation sets, without re-training, significantly lowers the MAEs by ∼35% and reduces the imbalance of the MAEs. Alternatively, MLPs trained on total energies provide a solution to the surface reconstruction inconsistencies since they eliminate the referencing issue, and have comparable MAEs to MLPs trained on adsorption energies.
The generalization ability of the MLPs is heavily dependent on the size and the diversity of the datasets used for training.13 The Open Catalyst 2020 dataset (OC20) is an example of a large and diverse dataset that is used to train MLPs for heterogeneous catalysis.14 The dataset was gathered by running 1.28 million DFT relaxations with ∼260 M single-point calculations of energy and forces. OC20 spans 55 unique elements that make up unary, binary and ternary materials and 82 adsorbates. The materials in the dataset can be categorized into intermetallics, metalloids, non-metals and halides based on the elements they contain.15 The OC20 dataset presented several tasks for benchmarking the accuracy of the MLPs. The most general task evaluates the ability of an MLP to predict the energy and the per-atom forces of a given material at any configuration along the relaxation trajectory. This task is referred to as structure to energy and forces (S2EF). The energy target is referenced following an adsorption energy referencing scheme according the eqn (1). The DFT total energy of the surface + adsorbate system (Esystem) can be the relaxed or intermediate structures along the relaxation trajectory. The reference energies of the relaxed surface (i.e., slab) without an adsorbate (Eslab) and the energy of the adsorbate in the gas phase (Egas) are subtracted from the system total energy to yield the adsorption energy. Following the referencing scheme used in the OC20 dataset,14 the value of Egas for each adsorbate was computed as a linear combination of N2, H2O, CO, and H2 atomic energies.
Eadsorption = Esystem − Eslab − Egas | (1) |
Fitting results for a series of increasingly sophisticated MLPs13,16–19 are shown in Fig. 1. Initially, the energy MAE of the different model architectures submitted decrease significantly by over 50% from OC20's initial release. However, the energy MAE of the models appears to plateau at ∼0.2 eV for the past two years. The target accuracy is 0.1 eV which is the estimated error level of the DFT calculations.20 One hypothesis that attempts to explain this plateauing behavior is that there are non-systematic errors that come from the DFT calculations used to generate the dataset that prevent the models from achieving lower errors. As shown by Kolluru et al.,15 the ML errors on nonmetals are the highest while the errors on intermetallics are the lowest. Nonmetals are thus hypothesized to have a larger error distribution compared to intermetallics and other material classes. In this work, we investigate the DFT settings and the adsorption energy referencing scheme used in the dataset as possible sources of this error imbalance.
Fig. 1 Mean absolute errors (MAEs) of the different MLPs submitted for evaluation on the structure to energy and forces (S2EF) task of the OC20 dataset.21 The MAEs are computed between the predicted energy of a given material at any configuration along the structure relaxation trajectory and the corresponding DFT energy. The reported values are the average MAEs across the in-domain and out-of-domain OC20 test sets. The MAEs appear to plateau at ∼0.2 eV for the past two years, signaling an inherent error distribution that can not be fit by the MLPs. |
While the DFT settings are often curated on a system-by-system basis, the scale and diversity of OC20 make such a task infeasible. As a result, OC20 uses a conservative set of settings with appropriate trade-offs between accuracy and efficiency. This work investigates the choice of the DFT settings used to generate the OC20 dataset as a possible source of this error imbalance. The OC20 dataset was run at the same DFT settings of all material classes. Although these settings might lead to converged calculations for intermetallics, they might not lead to converged calculations for nonmetals. The non-convergence of the calculations of the nonmetals with respect to the DFT settings could lead to non-systematic errors that are difficult to learn by the ML models, leading to inconsistently higher errors for nonmetals.
Beyond DFT convergence errors, DFT calculations with surface reconstruction can introduce inconsistencies in the adsorption energy referencing scheme used in the dataset that cannot be fit by the MLPs. The presence of an adsorbate on the surface during a relaxation can induce surface reconstruction that might not be observed in the case of the relaxing the surface without the adsorbate. To make an accurate adsorption energy calculation, the corresponding relaxed slab reference needs to be identical or similar to that of the relaxed adsorbate + slab. Therefore, whenever there is surface reconstruction, the slab reference is no longer a consistent reference, an ill-posed problem for the MLP. Thus, we hypothesize that removing these surface reconstructions would remove inconsistencies from the dataset and thus reduce ML errors.
In this work, we first show that the DFT convergence errors on total energies are significant and non-metals have total energy convergence errors with MAE of ∼0.15 eV. However, due to the cancellation of errors, we find the convergence errors on the adsorption energies are less than 0.05 eV across all material classes. Therefore, the OC20 dataset adsorption energies are converged with respect to the k-points sampling, plane-wave energy cutoff and smearing width DFT settings. Second, we show that removing the systems with surface reconstructions from the validation sets without re-training significantly decreases the ML energy MAEs by ∼35%. Lastly, we show that the total energy models can offer an alternative to adsorption energy models since they eliminate the referencing issue of calculations with surface reconstruction and reduce the imbalance of the MAEs between the different material classes. Moreover, due to cancellation of ML errors, total energy models have comparable MAEs to adsorption energy models.
In the Open Catalyst 2020 (OC20) dataset, the KPOINTS mesh was set to be inversely proportional to the unit cell size and is calculated based using eqn (2). The symbol a0 denotes the norm of the first lattice vector of the unit cell, and b0 denotes the norm of the second lattice vector of the unit cell. The method calculates the mesh size by dividing a predetermined multiplier by the norm of the unit cell vectors along the reciprocal lattice directions, effectively adjusting the k-point density according to the dimensions of the structure. The default value of this parameter is 40 which leads to (4 × 4 × 1) grid for a (4 × 4) copper (100) surface for instance. To test the convergence of the KPOINTS setting, we calculate the total energy at the current OC20 value of 40 and a much higher value of 80. Using a multiplier value of 80 leads to (8 × 8 × 1) grid for the copper (100) surface. To determine the convergence errors over KPOINTS, the total energy was calculated at a multiplier value of 40 and was subtracted from the total energy value at a multiplier of 80. Other multiplier values were not explored because there was not a significant difference in the energies calculated between the 40 and 80 multiplier values.
(2) |
To determine the convergence errors over the ENCUT setting, the total energies of the hundred systems were calculated at ENCUT values of 400, 450, 500, and 550 eV. These total energies were then compared to the total energies calculated at an ENCUT value of 350 eV which is the setting used in the OC20 dataset. To calculate the convergence errors over the SIGMA, the total energies calculated at SIGMA values of 0.15, 0.1, 0.05, and 0.01 eV were compared to the total energies calculated at a SIGMA value of 0.2 eV which is the OC20 setting. It was found that decreasing the SIGMA value leads to calculations that are not converged electronically. Therefore, we increased the number of electronic steps to 120 steps whereas the default value is 60 steps. These studies are used to determine the tighter KPOINTS, ENCUT and SIGMA DFT settings. We refer to these settings as the “tighter settings” in this paper.
To test the effect of using the tighter settings on a larger scale, we ran an experiment on the OC20-200k training set and the OC20-30k in-domain validation set. These two sets contain 200k and 30k structures that represent the distribution of materials in the full OC20 training and in-domain validation set respectively. These are S2EF sets which not only include the initial and the final structures but also include intermediate structures along the DFT relaxation trajectory. DFT single-point calculations were run at the tighter DFT settings on these structures. The relaxed slab energies at the tighter settings were obtained by taking the already relaxed OC20 slabs and relaxing them again at the tighter settings to avoid the computational cost of relaxing the slabs from scratch. The adsorbate reference energies in the gas phase were calculated from the atomic energies calculated at the tighter settings. DFT calculations that did not reach electronic convergence at the tighter settings were removed from this experiment. The difference in energy and forces between the data at the original OC20 settings and the tighter settings were quantified as convergence errors. To test the effect of using tighter DFT settings on the MLP MAEs, we compare the MAEs of 1) a Gemnet-OC13 model trained and evaluated on data at the original OC20 settings, 2) a Gemnet-OC model trained and evaluated on data at the tighter settings.
The DFT relaxation calculations were determined to be anomalous using an anomaly detection method developed in the AdsorbML paper.26 This method classifies an anomalous DFT relaxation calculation as having one or multiple of the following anomalies: adsorbate-induced surface reconstruction, adsorbate dissociation and adsorbate desorption. In this section, we focus only on the surface reconstruction detection method. The adsorbate dissociation and desorption anomalies do not affect the adsorption energy referencing and thus are not expected to affect the MAEs as shown in Section 4 of the ESI.† A slab + adsorbate system is determined to have an adsorbate-induced surface reconstruction based on the connectivity of the atoms in the system after being relaxed with and without an adsorbate. A connectivity matrix is constructed for the relaxed slab without an adsorbate. Another connectivity matrix is constructed for the relaxed adsorbate-slab configuration after removing the adsorbate. Atoms are considered connected if there is any overlap of the atomic radii with a small cushion. This cushion is a hyperparameter and was chosen to be a 1.5 multiplier of the covalent radii. If the connectivity matrices of the relaxed adslab surface and the relaxed slab surface are different, the system is flagged as a reconstructed surface.
After detecting the surface reconstructions, we run ML experiments to show the effect of removing these reconstructions from the validation set. We focus the ML experiments on three MLPs: GemNet-OC-large,13 eSCN16 and Equiformer-V2,17 the current top performing models on the OC20 dataset. We randomly selected 30k systems from each of the in-domain (ID) and out-of-domain (OOD) OC20 validation sets. The ID validation set is sampled from the same distribution as the training set. The OOD validation sets contain element compositions that were not seen by the MLPs in the training set. There are three OOD validation sets: unseen adsorbates (OOD Ads), unseen element compositions for catalysts (OOD Cat), and unseen both adsorbates and catalysts (OOD Both). To determine the effect of removing surface reconstructions, we compare the validation energy MAEs of pre-trained MLPs before and after removing the surface reconstructions from the validation sets. All pre-trained ML models used in this work are publicly available at the Open Catalyst Project (OCP) GitHub repository.
To investigate the effect of removing the surface reconstructions from both training and validation sets, we trained two models from scratch with and without the surface reconstructions. This experiment was done on the OC20-200k training set. After removing the surface reconstructions from this split, the size of this split becomes 156k structures. To enable meaningful comparisons, we randomly selected 156k data points from the 200k split to train a model on data that includes surface reconstructions. The performance of the two models trained on data with and without surface reconstructions is evaluated on the OC20-30k in-domain validation set after removing the surface-reconstructed systems.
Furthermore, we compare adsorption energy metrics using a total energy model and a baseline adsorption energy model on 25k in-domain and out-of-domain validation sets. These validation sets are the OC20 IS2RE (initial structure to relaxed energy) benchmark validation sets. For this comparison, we relax both the adsorbate + slab and slab structures using the total energy model. On the other hand, we can only relax the adsorbate + slab structure using the adsorption energy model since it assumes you have access to the DFT relaxed structure of the slab to be used for referencing. The structures were relaxed using ML until all per-atom forces were less than 0.03 eV Å−1 or up to 300 relaxation steps. For the total energy model, the adsorption energies are calculated using eqn (3). The total energy of the relaxed structure of adsorbate + slab structure (EMLsystem) and the energy of the relaxed structure of the slab (EMLslab) are both predicted using the total energy model. The energy of adsorbate in the gas-phase, (Egas), was calculated as a linear combination of N2, H2O, CO, and H2 atomic energies. The MAEs are calculated between the ML predicted adsorption energy energy and the DFT adsorption energy.
Eadsorption = EMLsystem − EMLslab − Egas | (3) |
The convergence errors over SIGMA were calculated by subtracting the total energy calculated at SIGMA values of 0.2, 0.15, 0.1, and 0.05 from the total energy at SIGMA of 0.01 eV as shown in Fig. 3b. The convergence errors with respect to SIGMA are smaller than the ENCUT convergence errors, yet they are significant (up to 0.15 eV). In choosing a SIGMA value, a trade-off was observed where decreasing the SIGMA value reduces the convergence errors but increases the number of calculations that are not converged electronically. Calculations that did not reach electronic convergence were removed from this study. A SIGMA value of 0.1 eV was chosen because it reduced the error distribution while only causing 2.5% of the calculations to be not converged electronically. SIGMA of 0.05 eV on the other hand led to 6% of the calculations being not converged electronically. Thus, the converged DFT settings are k-points multiplier = 40, ENCUT = 500 eV, SIGMA = 0.1 eV.
After determining the converged DFT settings, an experiment was run to determine the magnitude of the convergence errors on a large subset of the dataset. The convergence error results shown in Fig. 4 are computed based on the OC20-200k training split. Systems with convergence errors that lie outside the whiskers of the box plots with energy differences that lie in the top and bottom 5% of the energy difference distribution. These systems are considered outliers and are described in Section 2 of the ESI.† The magnitude of the convergence errors shown before in Fig. 3 is larger because we selected the nonmetallic systems that are expected to have the largest convergence errors. On the other hand, in Fig. 4, we compute the convergence errors on a larger and more representative sample of the full dataset, not just non-metals. The convergence errors in the slab + adsorbate total energies are significant (>0.1 eV) and nonmetals have the highest convergence errors as shown in Fig. 4a. The convergence errors in the slab energies are also significant and nonmetals have the highest convergence errors as shown in Fig. 4b. The slab energies at the tighter settings were computed by taking the relaxed slabs at the original settings and relaxing them again at the tighter settings. This leads to systematically negative slab energy differences (Etighter − EOC20) as shown in Fig. 4b. Due to cancellation of errors, the magnitudes of the convergence errors on the adsorption energy are significantly smaller than those on total energies as shown in Fig. 4c. Moreover, nonmetals have comparable convergence errors in the adsorption energies to other material classes. The adsorption energy differences are shifted to be more positive because of the negative systematic shift of the slab energies at the tighter settings. Overall, the adsorption energies appear to be mostly converged within 0.1 eV with respect to the KPOINTS, ENCUT and SIGMA.
Interestingly, we found that the energy MAE of a model trained on data with the tighter DFT settings is comparable to the MAE of a model trained on the data with the OC20 settings. As shown in Fig. 5a, the total energy MAEs are not significantly affected by using tighter DFT settings. The reduction of halides MAEs is larger than other material classes, but since the percentage of halides in the dataset is small (1.4%), the overall MAEs are not affected significantly. As shown in Fig. 5b, for all the material classes, the adsorption energy MAEs do not change significantly as a result of using the tighter DFT settings. Therefore, the ML errors are not affected by the DFT convergence errors. Moreover, for MLPs trained on the data with the tighter DFT settings, nonmetals and halides still have inconsistently higher errors compared to other material classes. Therefore, it can be concluded that the DFT convergence errors are not the cause of non-metals having inconsistently higher MAEs than other material classes.
The results of this section could be useful to future efforts focusing on generating large-scale DFT datasets. The OC20 dataset, for example, uses a conservative set of settings with appropriate trade-offs between accuracy and efficiency. We show that the cancellation of DFT errors reduces the effect of the convergence errors on the adsorption energy target of the OC20 dataset. However, one should not completely disregard the convergence errors in generating a DFT dataset assuming that the cancellation of errors would always lead to accurate energies. Instead, the effect of convergence errors should be quantified on a sufficiently large and representative subset of the dataset to determine the appropriate DFT settings needed. It is also important to mention that other settings besides the 3 DFT settings studied here could cause convergence issues. One example is the VASP ISMEAR parameter which determines how the partial occupancies are set for each orbital. The Methfessel–Paxton scheme was used in the OC20 dataset. This scheme could cause issues for semiconductors and insulators in the dataset. Therefore, if the MLPs trained on the OC20 dataset are used to study semiconductor or insulator systems, one should investigate the convergence errors of these systems with respect to the ISMEAR parameter.
Table 1 shows the effect of removing the calculations with surface reconstructions from the validation sets, without retraining, on the MAEs. Consistently across all the three model architectures investigated, removing the surface reconstructions significantly reduces the MAEs. This finding supports the hypothesis that the surface reconstructions introduce inconsistencies in the adsorption energy referencing scheme used in the OC20 dataset which cannot be fit by the MLPs. Moreover, the forces MAEs did not change significantly as a result of removing the surface reconstructions as shown in Table S1.† This is expected since these reconstructions only affect the referencing of the energy whereas the forces are not referenced.
OC20-S2EF validation energy MAE [eV] | ||||||
---|---|---|---|---|---|---|
ID | OOD ads | OOD cat | OOD both | Average | ||
GemNet-OC | With reconstructions | 0.164 | 0.191 | 0.286 | 0.353 | 0.248 |
Without reconstructions | 0.100 | 0.136 | 0.168 | 0.234 | 0.160 | |
eSCN | With reconstructions | 0.169 | 0.213 | 0.255 | 0.344 | 0.245 |
Without reconstructions | 0.097 | 0.153 | 0.142 | 0.231 | 0.156 | |
EqV2 | With reconstructions | 0.159 | 0.172 | 0.257 | 0.317 | 0.226 |
Without reconstructions | 0.090 | 0.112 | 0.146 | 0.200 | 0.137 |
Next, we ran an experiment that tested the effect of removing the surface reconstructions from both training and validation sets. A Gemnet-OC model trained on a dataset that includes surface reconstructions has an energy MAE of 0.31 eV whereas a model trained on a dataset without surface reconstructions has an MAE of 0.29 eV when both are evaluated on an in-domain validation set without surface reconstructions. Since there is no significant reduction (∼0.02 eV) in the model errors as a result of removing the surface reconstructions from the training set, retraining the models on datasets without surface reconstructions might not be necessary. This result is similar to what has been observed by Vita et al.27 where they showed that models trained on a dataset with added noise to the energy and force labels have comparable errors to models trained on non-noisy data given the models are trained on large datasets and evaluated on a non-noisy test set. In our case, this added noise is represented by the added reconstruction energy to the adsorption energy label. The models have comparable errors whether they are trained on data with or without this added noise since they are trained on a large dataset and evaluated on a non-noisy dataset.
Removing calculations with surface reconstructions is not an ideal solution since these calculations are valid DFT calculations. It is only the adsorption energy referencing for these systems that is ill-posed. Training models to predict the total energy instead of the adsorption energy eliminates the referencing issue of the calculations with surface reconstructions. Using a model trained on total energy, we show that removing the calculations with surface reconstructions does not affect the total energy MAEs, as shown in Fig. S3.† Therefore, MLPs trained on total energies instead of adsorption energies resolve the issue of calculations with surface reconstructions.
OC20-IS2RE validation energy MAE [eV] | ||||||
---|---|---|---|---|---|---|
ID | OOD ads | OOD cat | OOD both | Average | ||
Total energy model | Adsorbate + slab energy | 0.344 | 0.371 | 0.630 | 0.637 | 0.495 |
Slab energy | 0.206 | 0.207 | 0.530 | 0.541 | 0.371 | |
Adsorption energy | 0.384 | 0.408 | 0.384 | 0.351 | 0.382 | |
Adsorption energy model | Adsorption energy | 0.334 | 0.364 | 0.385 | 0.354 | 0.359 |
Moreover, we find that the adsorption energy MAEs of the total energy model are slightly higher, but still comparable to the MAEs of the adsorption energy model. It is worth re-emphasizing that we relaxed both the adsorbate + slab and the slab structures using the total energy model. Whereas we relaxed the adsorbate + slab structure only using the adsorption energy model and relaxed the slab structure using DFT. Therefore, although the total energy model has slightly higher MAEs, it saves the cost of relaxing the slab structure using DFT. Even after removing calculations with surface reconstructions, we also show that the total energy model has comparable MAEs to the adsorption energy model as shown in Table S3.† The MAEs in Table 2 can be reduced significantly by using an active learning framework such as the FINETUNA29 framework which accelerates the DFT relaxations accurately by leveraging pre-trained MLPs on the OC20 dataset. The framework was found to reduce the number of DFT single-point calculations during a relaxation by 91% while maintaining an accuracy threshold of 0.02 eV in 93% of cases.29 Besides the magnitude of the MAEs, the total energy model shows a more uniform distribution of MAEs across the different classes as shown in Fig. S4.† Therefore, besides resolving the issue with calculations with surface reconstructions, total energy models reduce the MAE imbalance between the different material classes in the OC20 dataset.
This work highlights the importance of considering the cancellation of DFT errors in building large datasets for heterogeneous catalysis applications. It also shows the importance of having a consistent reference scheme for the performance of MLPs. Similarly to DFT, total energy models show cancellation of errors of the adsorbate + slab and slab energy predictions. Before using an adsorption energy model, one should be careful about predicting the energy of systems with surface reconstructions as they can hurt the model performance. Alternatively, total energy models are more robust models to surface reconstructions that still work on par with existing adsorption energy models, albeit with slightly worse performance. Total energy models, however, do provide access to predictions on clean slabs, a limitation of current adsorption energy models.
The findings of this work can be useful in investigating other large datasets for heterogeneous catalysis such as the Open Catalyst 2022 (OC22) dataset.25 Although the convergence errors with respect to the DFT settings were not found to be a significant issue for the OC20 adsorption energies, they can cause significant issues for the OC22 dataset. The selection of Hubbard U corrections and the magnetic moment initialization can cause significant inconsistencies in the OC22 dataset. An interesting future direction is to explore the effect of changing these DFT settings on the accuracy of the MLPs trained on the OC22 dataset.
Footnote |
† Electronic supplementary information (ESI) available: Additional data supporting the conclusions. See DOI: https://doi.org/10.1039/d4cy00615a |
This journal is © The Royal Society of Chemistry 2024 |