Shuyi
Jia
a,
Akaash R.
Parthasarathy
a,
Rui
Feng
a,
Guojing
Cong
b,
Chao
Zhang
a and
Victor
Fung
*a
aComputational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA. E-mail: victorfung@gatech.edu
bOak Ridge National Laboratory, Oak Ridge, TN, USA
First published on 21st February 2024
While pre-training has transformed many fields in deep learning tremendously, its application to three-dimensional crystal structures and materials science remains limited and under-explored. In particular, devising a general pre-training objective which is transferable to many potential downstream tasks remains challenging. In this paper, we demonstrate the benefits of pre-training graph neural networks (GNNs) with the objective of implicitly learning an approximate force field via denoising, or explicitly via supervised learning on energy, force, or stress labels. For implicit learning of the force field, we find there are significant benefits to training the model on the derivatives of the output, rather than on the output itself. We further show an explicit training of the force field using labelled data yields an even greater benefit than implicit training, and similarly benefits from a derivative-based training objective. We find that overall, the best pre-training performance can be achieved by explicitly learning the full combination of energy, force, and stress labels using output derivatives. This pre-training approach is advantageous as it leverages readily available forces from non-equilibrium structures produced during ab initio calculations, enabling the usage of significantly larger datasets for pre-training than using only equilibrium structures in denoising. We demonstrate the effectiveness of this approach on a wide range of materials property benchmarks across many materials systems and properties. These results suggest exciting future opportunities for scaling up pre-training on GNNs to build foundational models in materials science.
Recent advancements in pre-training in molecular and materials science can be primarily categorized into two learning paradigms: transfer learning and self-supervised learning. In the case of transfer learning, neural network models are first trained on large datasets with specific target properties before being finetuned on downstream tasks.14,15 In comparison, pre-training via self-supervised learning does not rely on explicit labels, but uses surrogate tasks to generate its own training data. Such pre-training tasks include context prediction and attribute masking.16,17 Additionally, a subset of self-supervised learning known as contrastive learning has emerged recently as a popular framework for pre-training. Specifically, contrastive learning relies on learning representations by contrasting self-generated positive and negative samples given an input graph. By doing so, the model can acquire a more discriminative and generalizable representation, thereby yielding remarkable performance in molecular and materials property prediction.18–21
Recently, a novel pre-training technique for 3D structures based on denoising achieved state-of-the-art (SOTA) performance on multiple molecular benchmarks.22 Specifically, the goal of pre-training, referred to as the denoising objective, is to predict the amount of i.i.d. noise added to the spatial coordinates of 3D molecular structures at equilibrium. Relying on the connection between denoising autoencoders and score-matching,23–25 it can also be shown that such a denoising objective is equivalent to implicitly learning an approximate force field. The empirical success of pre-training via denoising indicates that implicitly learning a force field does translate to learning better and meaningful representations for downstream tasks. However, the question persists as to whether explicitly learning the force field would lead to comparable, if not superior, performance. Additionally, a drawback of pre-training via denoising is the requirement of the upstream dataset to consist solely of equilibrium structures, e.g. structures at energy minima.
Inspired by the denoising approach,22 in this work, we focus on the problem of pre-training GNNs with the objective of learning an approximate force field, which can be learnt in two ways, implicitly and explicitly. Pre-training via denoising is an example of the former, and our approach—derivative-based pre-training with forces—illustrates the explicit method. In the context of our work, derivative-based specifically denotes the process of obtaining model predictions by differentiating model outputs with respect to atomic positions. Concretely, in our approach, we optimize a GNN to directly minimize the loss between model derivatives and forces on 3D structures. In practice, additional graph-level objectives such as energies and stress are also incorporated to learn more meaningful representations during pre-training. It is also worth noting that while the original pre-training via denoising adopts a node-level noise prediction head, it can be made into a derivative-based form by equating noise to model derivatives with respect to atomic positions.26
The motivation behind our work is as follows. First, in materials chemistry, the majority of computational datasets available are acquired through ab initio calculations like density functional theory (DFT). A noteworthy aspect is that during the generation of equilibrium structures via DFT, numerous non-equilibrium structures with forces are also produced. This means that forces can be regarded as readily available labels, eliminating the strict requirement of equilibrium structures for pre-training through denoising. Second, the absence of constraints posed by pre-training methods which use only equilibrium structures allows us to capitalize on significantly larger datasets that include forces, presenting a valuable and exciting opportunity for scaling up pre-training to build foundational models in materials science. Additionally, while the learning of a force field with interatomic forces and additional attributes such as energies and stress is an established approach,13,27 its application as a pre-training strategy remains largely under-explored. Importantly, prior research on interatomic potentials predominantly focuses on tasks such as structure relaxation and dynamics simulations. In contrast, our approach distinctly investigates the advantages of pre-training with forces for downstream target property prediction.
Our contributions can be summarized as follows. Firstly, we establish that derivative-based pre-training via denoising outperforms its non-derivative-based counterpart, which relies on a prediction head, in downstream property prediction tasks. Secondly, we demonstrate the consistently better performance of our pre-training approach-derivative-based pre-training with forces and additional objectives-across an extensive array of materials property benchmarks when compared to both denoising variants. These findings underscore the advantages of explicit learning over implicit learning of the underlying approximate force field.
(1) |
(2) |
(3) |
Concretely, the denoising objective requires us to first perturb each structure by adding i.i.d. Gaussian noise to each atom's coordinates. In other words, we generate a noisy copy of S, denoted by:
(4) |
(5) |
(6) |
(7) |
Fig. 1 provides an overview of our pre-training strategy.
The finetuning performance of the pre-training strategies is tested on a wide variety of properties including those in the MatBench suite.31 Specifically, we have chosen 8 properties, namely exfoliation energy, frequency at last phonon PhDOS peak, refractive index, shear and bulk modulus, formation energy, and band gap. We have also included a dataset containing two-dimensional materials (2D materials) with work function as the target property,32 a dataset containing crystal structures of metal–organic frameworks (MOFs) with band gap as the target property,33 and a dataset containing metal alloy surfaces (surface) with adsorption energy as the target property.34
An overview of the finetuning datasets used is provided in Table 1.
Dataset | # Structures | Property | Unit |
---|---|---|---|
JDFT | 636 | Exfoliation energy | meV per atom |
Phonons | 1265 | Freq. at last phonon PhDOS peak | 1 cm |
Dielectric | 4764 | Refractive index | — |
(log) GVRH | 10987 | Shear modulus | GPa |
(log) KVRH | 10987 | Bulk modulus | GPa |
Perovskite | 18928 | Formation energy | eV per atom |
MP form | 132752 | Formation energy | eV per atom |
MP gap | 106113 | Band gap | eV |
2D | 3814 | Work function | eV |
MOF | 13058 | Band gap | eV |
Surface | 37334 | Adsorption energy | eV |
The finetuning results of both pre-training strategies on CGCNN28 are shown in Table 2. We first observe that both derivative-based denoising and prediction head denoising fail to beat the baseline with negative average percentage improvements of −6.98% and −7.71% respectively. However, if we exclude the anomalously high MAE on the phonons dataset, the average percentage improvements in MAE for the two variants of denoising become 3.13% and −6.84% respectively. This suggests that derivative-based denoising outperforms its non-derivative-based counterpart in terms of fine-tuning performance. A similar trend is evident in the results obtained using the TorchMD-Net model, as detailed in the subsequent paragraphs. It should be highlighted that prediction head denoising experiences a significant loss plateau early in the pre-training phase, which might have led to dissimilar atomic representations under principal component analysis compared to other pre-training strategies and variants (see ESI Appendix D†).
JDFT | Phonons | Dielectric | GVRH | KVRH | Perovskites | 2D | MOF | Surface | MP gap | MP form | Avg. % impr. | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | 62.6 | 59.1 | 0.406 | 0.105 | 0.0736 | 0.0437 | 0.263 | 0.343 | 0.0852 | 0.230 | 0.0417 | — | |
Forces | 1:0:0 | 46.5 | 57.6 | 0.380 | 0.098 | 0.0728 | 0.0365 | 0.212 | 0.297 | 0.0743 | 0.218 | 0.0380 | 10.8% |
0:1:0 | 50.5 | 63.7 | 0.338 | 0.111 | 0.0942 | 0.0401 | 0.240 | 0.319 | 0.0809 | 0.230 | 0.0443 | 1.58% | |
0:1:1 | 56.4 | 67.9 | 0.377 | 0.103 | 0.0777 | 0.0411 | 0.232 | 0.311 | 0.0800 | 0.226 | 0.0406 | 3.28% | |
1:1:1 | 48.9 | 56.4 | 0.248 | 0.0945 | 0.0709 | 0.0370 | 0.213 | 0.296 | 0.0741 | 0.210 | 0.0383 | 14.3% | |
1:500:500 | 45.5 | 60.7 | 0.324 | 0.0962 | 0.0720 | 0.0366 | 0.209 | 0.291 | 0.0747 | 0.212 | 0.0392 | 12.1% | |
Derivative-based denoising | 46.5 | 123.0 | 0.386 | 0.0947 | 0.0743 | 0.0389 | 0.212 | 0.473 | 0.0883 | 0.227 | 0.0410 | −6.98% (3.13%) | |
Prediction head denoising | 59.0 | 68.8 | 0.376 | 0.109 | 0.0933 | 0.0437 | 0.278 | 0.417 | 0.0965 | 0.228 | 0.0464 | −7.71% (−6.84%) | |
Best % impr. | 27.3% | 4.57% | 38.9% | 10.0% | 3.67% | 16.5% | 20.5% | 15.2% | 13.0% | 8.70% | 8.87% | — |
Next, we observe that derivative-based pre-training with forces and additional objectives yields superior performance across all downstream datasets. Specifically, the average percentage improvement in MAE in comparison to the baseline ranges from 1.58% (ratio 0:1:0) to as high as 14.3% (ratio 1:1:1). In addition, we note that our pre-training strategy not only consistently outperforms the baseline but also surpasses both variants of pre-training via denoising. Interestingly, the ratios 1:1:1 and 1:500:500 almost always yield the best results, doing so for 9 of the 11 datasets tested. In comparison, pre-training with forces alone, e.g. 0:1:0, struggles to outperform the baseline on datasets like JDFT, GVRH, and KVRH.
To show that derivative-based pre-training with forces is model-agnostic and can be beneficial beyond CGCNN, we applied the same pre-training strategies to the TorchMD-Net architecture.29 TorchMD-Net is an equivaraint transformer whose layers maintain per-atom scalar features and vector features that are updated by a self-attention mechanism. Similar to CGCNN, we first obtain a scalar output from the model before auto-differentiating with respect to positions to obtain forces or noise predictions for derivative-based approaches.
In Table 3, we evaluate the finetuning performance of derivative-based pre-training with forces and pre-training via denoising on the TorchMD-Net architecture. First, we observe that both variants of denoising beat the baseline, with derivative-based denoising emerging as the better option, achieving an average percentage improvement in MAE of 23.5%. Similar to what we have observed in the case of CGCNN, derivative-based pre-training with forces significantly improves over the baseline for an average percentage improvement in MAE ranging from 15.1% (ratio 1:0:0) to 25.1% (ratio 1:500:500). We also observe that our pre-training strategy performs better than or equal to pre-training via denoising for 10 out of 11 datasets. The only exception is the JDFT dataset, where the MAE for derivative-based denoising is lower than the best performing ratio of pre-training with forces by 0.4.
JDFT | Phonons | Dielectric | GVRH | KVRH | Perovskites | 2D | MOF | Surface | MP gap | MP form | Avg. % impr. | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | 55.7 | 117 | 0.415 | 0.107 | 0.0840 | 0.0440 | 0.287 | 0.265 | 0.0774 | 0.231 | 0.0351 | — | |
Forces | 1:0:0 | 57.6 | 108 | 0.364 | 0.0830 | 0.0616 | 0.0416 | 0.190 | 0.260 | 0.0579 | 0.195 | 0.0287 | 15.1% |
0:1:0 | 51.0 | 111 | 0.320 | 0.0805 | 0.0596 | 0.0338 | 0.186 | 0.239 | 0.0528 | 0.164 | 0.0240 | 22.8% | |
0:1:1 | 47.4 | 111 | 0.317 | 0.0800 | 0.0597 | 0.0355 | 0.180 | 0.238 | 0.0518 | 0.168 | 0.0250 | 23.1% | |
1:1:1 | 50.9 | 105 | 0.306 | 0.0798 | 0.0608 | 0.0395 | 0.189 | 0.245 | 0.0531 | 0.194 | 0.0261 | 20.3% | |
1:500:500 | 39.0 | 109 | 0.272 | 0.0759 | 0.0569 | 0.0356 | 0.185 | 0.246 | 0.0540 | 0.179 | 0.0250 | 25.1% | |
Derivative-based denoising | 38.6 | 111 | 0.331 | 0.0770 | 0.0572 | 0.0371 | 0.180 | 0.241 | 0.0534 | 0.177 | 0.0259 | 23.5% | |
Prediction head denoising | 46.0 | 128 | 0.353 | 0.0829 | 0.0592 | 0.0368 | 0.194 | 0.242 | 0.0536 | 0.179 | 0.0272 | 18.9% | |
Best % impr. | 30.7% | 10.3% | 34.5% | 29.1% | 32.3% | 19.3% | 37.3% | 10.2% | 33.1% | 29.0% | 31.6% | — |
Furthermore, it is worth noting that when the TorchMD-Net model is pre-trained with forces alone (ratio 0:1:0), we observe a substantial average percentage improvement of 22.8% on the fine-tuning datasets. This improvement is significantly higher compared to the previously achieved 1.58% with CGCNN. Considering results from both CGCNN and TorchMD-Net, it becomes evident that the most effective pre-training strategy involves a combination of node-level and graph-level objectives. This is substantiated by the consistently strong performance observed with ratios such as 1:1:1 and 1:500:500.
JDFT | Phonons | Dielectric | GVRH | KVRH | Perovskites | 2D | MOF | Surface | MP gap | MP form | Avg. % impr. | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline | 62.3 | 59.5 | 0.355 | 0.0947 | 0.0681 | 0.0437 | 0.254 | 0.315 | 0.0821 | 0.223 | 0.0402 | — | |
Forces | 1:0:0 | 42.2 | 54.6 | 0.313 | 0.0921 | 0.0677 | 0.0364 | 0.208 | 0.299 | 0.0752 | 0.215 | 0.0395 | 9.94% |
0:1:0 | 42.3 | 51.0 | 0.332 | 0.102 | 0.0746 | 0.0403 | 0.249 | 0.302 | 0.0758 | 0.225 | 0.0363 | 6.00% | |
0:1:1 | 49.5 | 49.5 | 0.360 | 0.0993 | 0.0738 | 0.0400 | 0.267 | 0.301 | 0.0753 | 0.217 | 0.0345 | 5.06% | |
1:1:1 | 46.9 | 41.7 | 0.255 | 0.0887 | 0.0659 | 0.0360 | 0.204 | 0.295 | 0.0730 | 0.212 | 0.0353 | 14.9% | |
1:500:500 | 39.6 | 44.4 | 0.309 | 0.0932 | 0.0715 | 0.0357 | 0.205 | 0.289 | 0.0694 | 0.210 | 0.0322 | 14.4% | |
Derivative-based denoising | 41.2 | 55.1 | 0.287 | 0.0873 | 0.0661 | 0.0370 | 0.185 | 0.291 | 0.0739 | 0.212 | 0.0326 | 14.1% | |
Best % impr. | 36.4% | 29.9% | 28.2% | 7.81% | 3.23% | 18.3% | 27.2% | 8.25% | 15.5% | 5.83% | 19.9% | — |
We observe that there is a general increase in performance over the baseline when global mean pooling is used. Derivative-based pre-training with forces performs the best for 9 of 11 datasets with an average percentage improvement in MAE ranging from 5.06% (ratio 0:1:1) to 14.9% (ratio 1:1:1). Compared to the range of 1.58% to 14.3% when add pooling is applied to our pre-training strategy, we note that there is an improvement in both the lower and upper bounds. Further, derivative-based denoising with mean pooling beats the baseline by as high as 14.1% on average, which is significantly better considering the fact that it fails to beat the baseline with add pooling. Such a huge improvement stems from the significant decrease in MAE for phonons and MP formation energy. More importantly, this shows that derivative-based pre-training with forces and derivative-based denoising can be beneficial regardless of the type of pooling used in downstream tasks.
In Table 5, we evaluate the 0:1:0 and 1:500:500 variants with 5 different number of epochs: 12, 25, 50, 100, and 200. In general, we observe that as the number of pre-training epochs increases, the greater the decrease is in MAE during finetuning. This trend is most evident in the case of ratio 1:500:500—finetuning results obtained from the model pre-trained for 200 epochs are the best for 9 of 11 datasets. The same decrease of MAE with an increase in the number of epochs for pre-training is observed in the case of ratio 0:1:0, albeit less consistent. This observation underscores the direct relationship between the level of convergence achieved by the pre-trained model and the subsequent improvement in downstream finetuning performance.
Ratio | # Epochs | JDFT | Phonons | Dielectric | GVRH | KVRH | Perovskites | 2D | MOF | Surface | MP gap | MP form |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0:1:0 | 12 | 46.6 | 77.3 | 0.345 | 0.106 | 0.0759 | 0.0426 | 0.255 | 0.327 | 0.0823 | 0.229 | 0.0408 |
25 | 54.3 | 70.3 | 0.319 | 0.103 | 0.0749 | 0.0417 | 0.249 | 0.328 | 0.0816 | 0.227 | 0.0417 | |
50 | 48.7 | 72.9 | 0.365 | 0.106 | 0.0749 | 0.0406 | 0.245 | 0.357 | 0.0801 | 0.230 | 0.0414 | |
100 | 50.5 | 63.7 | 0.338 | 0.111 | 0.0942 | 0.0401 | 0.240 | 0.319 | 0.0809 | 0.230 | 0.0443 | |
200 | 52.3 | 66.7 | 0.368 | 0.107 | 0.0778 | 0.0399 | 0.233 | 0.320 | 0.0776 | 0.227 | 0.0394 | |
1:500:500 | 12 | 46.7 | 59.8 | 0.343 | 0.116 | 0.0761 | 0.0392 | 0.241 | 0.311 | 0.0799 | 0.229 | 0.0411 |
25 | 49.0 | 65.8 | 0.324 | 0.102 | 0.0790 | 0.0383 | 0.220 | 0.305 | 0.0802 | 0.228 | 0.0407 | |
50 | 46.4 | 78.0 | 0.346 | 0.102 | 0.0727 | 0.0369 | 0.213 | 0.299 | 0.0762 | 0.220 | 0.0387 | |
100 | 45.5 | 85.8 | 0.324 | 0.0962 | 0.0720 | 0.0366 | 0.208 | 0.291 | 0.0747 | 0.212 | 0.0392 | |
200 | 38.1 | 59.9 | 0.320 | 0.0949 | 0.0708 | 0.0351 | 0.200 | 0.287 | 0.0730 | 0.215 | 0.0343 |
Overall, we observe that derivative-based pre-training with forces consistently outperforms both variants of the denoising approach, highlighting the effectiveness of explicitly learning a force field for downstream property prediction tasks. Beyond its superior finetuning performance, another advantage of pre-training with forces is its compatibility with non-equilibrium structures during training, thereby allowing the use of diverse datasets. However, a limitation of this approach is its dependency on datasets with forces as labels, which could pose challenges in obtaining such labeled data for specific domains or applications. On the other hand, pre-training via denoising, being self-supervised, is in theory applicable to a broader range of datasets. However, in reality, a significant drawback of the denoising approach is its need of training structures to be at equilibrium, a requirement that might be prohibitively expensive to satisfy and limit the approach's generalizability to non-equilibrium structures.
In a concurrent study similar to this work, Shoghi et al.37 explore the effectiveness of pre-training with forces for atomic property prediction. Notably, their methodology involves supervised pre-training with forces and energies, utilizing an expansive pre-training dataset comprising 120 million samples. The models used by Shoghi et al.37 demonstrate the remarkable advantages of pre-training in line with our observations, however we note several differences between their work and ours. First, in their study, Shoghi et al.37 employ individual prediction heads for each pre-training dataset, whereas we adopt a derivative-based approach wherein the scalar model outputs are differentiated with respect to atomic coordinates to generate forces predictions. Second, although the denoising approach is mentioned in the study by Shoghi et al.37 a systematic and direct comparison between the two approaches was not performed. In contrast, our investigation systematically evaluates both pre-training strategies and reveals that the explicit learning from force labels, as opposed to the implicit learning of denoising, proves to be a more effective pre-training strategy. Furthermore, we also observe that derivative-based denoising performs better than prediction head denoising. Lastly, our pre-training dataset consists of 190 thousand samples, representing a mere 0.158% of the 120 million samples used by Shoghi et al.37,38 Despite this substantial difference in dataset size, our finetuning performance exhibits notable improvement, demonstrating the effectiveness of our approach even when the pre-training dataset is limited.
In summary, we introduced a derivative-based pre-training strategy for graph neural networks (GNNs) based on explicit learning of an approximate force field coupled with additional objectives such as energies and stress on 3D crystal structures. We demonstrated that this pre-training strategy is model-agnostic and significantly improves the downstream finetuning performance across a diverse collection of datasets with different materials systems and target properties. This technique enables us to utilize forces that are readily obtainable during ab initio calculations as labels, thereby unlocking the capability to utilize much larger datasets during pre-training. Our work thus introduces exciting opportunities in the future to scale up pre-training to build foundational models within the field of materials science.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00214d |
This journal is © The Royal Society of Chemistry 2024 |