Kangming
Li
*a,
Brian
DeCost
b,
Kamal
Choudhary
b and
Jason
Hattrick-Simpers
acde
aDepartment of Materials Science and Engineering, University of Toronto, 27 King's College Cir, Toronto, ON, Canada. E-mail: kangming.li@utoronto.ca; jason.hattrick.simpers@utoronto.ca
bMaterial Measurement Laboratory, National Institute of Standards and Technology, 100 Bureau Dr, Gaithersburg, MD, USA
cAcceleration Consortium, University of Toronto, 27 King's College Cir, Toronto, ON, Canada
dVector Institute for Artificial Intelligence, 661 University Ave, Toronto, ON, Canada
eSchwartz Reisman Institute for Technology and Society, 101 College St, Toronto, ON, Canada
First published on 30th April 2024
Use of machine learning has been increasingly popular in materials science as data-driven materials discovery is becoming the new paradigm. Reproducibility of findings is paramount for promoting transparency and accountability in research and building trust in the scientific community. Here we conduct a reproducibility analysis of the work by K. Choudhary and B. Brian [npj Comput. Mater., 7, 2021, 185], in which a new graph neural network architecture was developed with improved performance on multiple atomistic prediction tasks. We examine the reproducibility for the model performance on 29 regression tasks and for an ablation analysis of the graph neural network layers. We find that the reproduced results generally exhibit a good quantitative agreement with the initial study, despite minor disparities in model performance and training efficiency that may be resulting from factors such as hardware difference and stochasticity involved in model training and data splits. The ease of conducting these reproducibility experiments confirms the great benefits of open data and code practices to which the initial work adhered. We also discuss some further enhancements in reproducible practices such as code and data archiving and providing data identifiers used in dataset splits.
We present a case study aiming to reproduce the main results of K. Choudhary and B. DeCost centering on the development of Atomistic Line Graph Neural Network (ALIGNN).19 Indeed, among the variety of machine learning architectures, Graph Neural Networks (GNNs) have demonstrated state-of-the-art performance in capturing complex atomistic relationships and predicting material properties.20–25 As one of the first GNN architectures that account for many-body interactions, ALIGNN performs message passing on both the interatomic bond graph and its line graph corresponding to bond angles. This explicit inclusion of angle information was demonstrated to improve performance on multiple atomistic prediction tasks.19 While a number of advanced architectures (in particular equivariant GNNs)26–29 have been proposed with improved performance in some cases,30 it is still an open question whether equivariant GNNs have a substantial and systematic advantage over invariant ones.31 In addition, recent benchmarks show that ALIGNN model remains competitive with respect to other leading GNNs in terms of accuracy and robustness.32–34 As ALIGNN is often used as a representative GNN in many ML studies,32–36 we feel it is an important target for reproducibility assessment.
The original ALIGNN study incorporated an evaluation of the model's performance on 52 crystal and molecular properties across the JARVIS-DFT,37 Materials Project,38 and QM9 databases,39 supplemented by an ablation analysis of ALIGNN models trained on formation energy and band gap data from the JARVIS-DFT database. Here we focus on evaluating the reproducibility of (1) the model performance on all of the 29 regression tasks from the JARVIS-DFT database, and (2) of the ablation analysis using models trained on the formation energy data from the JARVIS-DFT database.
The remainder of this paper is organized as follows: Section 2 details the procedure to set up the model training. Section 3 and 4 present the reproducibility analysis of the model performance and ablation analysis, respectively. Finally, Section 5 is devoted to discussing the observed challenges and providing suggestions for better reproducibility.
While no obvious warnings or errors occurred during the installation process, subsequent ALIGNN training with GPU support encountered errors linked to the version of . The latter is the Deep Graph Library (DGL)40 utilized for the model's implementation. This issue might be attributable to the deprecated package name recommended in the installation guide. On the other hand, updating to a newer version induced installation errors due to dependency conflicts between and involving the library used by for configuration parsing and validation. Moreover, the training failure persisted. Upon further examination, it was found that the order of package installation was a critical factor: successful ALIGNN training could only be achieved when an updated version of was installed prior to , and not vice versa. This sequence still generated dependency conflicts but these did not impede the ALIGNN training. The exact cause of this delicate dependency on the installation order is not exactly clear. However, our hypothesis is that the package installs the CPU-only version of , so that the package manager skips installation of the explicitly-requested version of without providing a clear warning.
We chose to use the current version 2023.5.1 of in our reproducibility study rather than using the specific revision from the original ALIGNN study due to ongoing updates in the codebase. Regardless, the modifications in training efficiency and model performance are anticipated to be minimal as there has been no major update in the relevant components.
The interpretation of the retrieved data is generally straightforward with the provided property labels, albeit not always intuitively so. For instance, the retrieved data includes properties labeled as “magmom_oszicar” and “magmom_outcar”, and it is unclear which one corresponds to the magnetic moment data discussed in the paper. Nonetheless, the mean absolute deviation (MAD) of the data was provided in the original paper, which can be used to disambiguate the property labels' significance.
(1) |
Fig. 1 presents the minimum, maximum, mean, and standard deviation of the MAE deviations for the selected properties, along with the corresponding total number of entries. Two general observations emerge from these results. Firstly, the variability in the reproduced MAE is more conspicuous for properties with fewer entries, with the exception of the models for predicting p-type Seebeck coefficients, which exhibit a relatively large MAE variance despite having over 20k entries. Since the MAE variance stems from the random seeds used for model parameter initialization and random batch construction for stochastic gradient optimization, this suggests that model performance on smaller datasets is more susceptible to parameter initialization and other sources of training variation. This is expected because model training with smaller datasets is expected to fall into the high variance regime and poses challenges for convergence towards a model with consistent predictive behavior. Furthermore, we anticipate that the variance could potentially increase if different data splits (even if the split ratio is the same) were employed. Secondly, the model performance reported in the original ALIGNN paper is reasonably well reproduced in this study. As can be seen from Fig. 1, the original MAE values are covered within the ranges of the reproduced MAE for 19 out of the 29 tasks. For the remaining 10 tasks whose reproduced MAEs do not cover the original ones, the minimum absolute MAE deviation is no more than 5%, which means that the original MAE can be matched within the 5% deviation by one of our independent runs.
First, we note that the model performance on various properties in the previous section was obtained with 4 ALIGNN and 4 GCN layers. In particular, for the formation energy prediction, we obtained the same MAE (0.033 eV per atom) as in the original paper. We use this value as the baseline to normalize the MAEs obtained with different numbers of ALIGNN and GCN layers in the ablation analysis. The resulting normalized MAEs from the original paper and this work are shown in Table 1. Overall, similar effects of the layers on the model performance are reproduced, with a maximum deviation of 6% from the original value when using 1 ALIGNN and 1 GCN layer. Such a deviation is expected to be within the error bar of the performance, since here we performed only a single model training run with a fixed random seed for each layer combination due to the high training cost.
Original paper | GCN-0 | GCN-1 | GCN-2 | GCN-3 | GCN-4 |
---|---|---|---|---|---|
ALIGNN-0 | 13.48 | 1.97 | 1.52 | 1.36 | 1.33 |
ALIGNN-1 | 1.94 | 1.24 | 1.12 | 1.09 | 1.12 |
ALIGNN-2 | 1.18 | 1.09 | 1.03 | 1.03 | 1.03 |
ALIGNN-3 | 1.09 | 1.03 | 1.00 | 1.03 | 1.03 |
ALIGNN-4 | 1.03 | 1.03 | 1.03 | 1.03 | 1.00 |
This work | GCN-0 | GCN-1 | GCN-2 | GCN-3 | GCN-4 |
---|---|---|---|---|---|
ALIGNN-0 | 13.66 | 1.95 | 1.48 | 1.30 | 1.27 |
ALIGNN-1 | 1.95 | 1.16 | 1.11 | 1.09 | 1.09 |
ALIGNN-2 | 1.22 | 1.06 | 1.08 | 1.06 | 1.04 |
ALIGNN-3 | 1.07 | 1.03 | 1.02 | 1.03 | 1.02 |
ALIGNN-4 | 1.04 | 1.02 | 1.00 | 1.01 | 1.00 |
The original ALIGNN paper also documented the training time per epoch as a function of the number of layers. Since we did not use the same hardware as the original paper, it is not suitable to compare directly the training time. Instead, we normalize the training time per epoch with respect to that of the 4 ALIGNN + 4 GCN layer configuration and compare the normalized training cost as shown in Table 2. Compared to the effect of number of layers on the model performance, reproducing the effect on the training time proves more challenging. For instance, using no ALIGNN and GCN layer requires only 11% of the training cost of that using 4 ALIGNN and 4 GCN layers according to the original paper, whereas using no ALIGNN and GCN layer is found to require 19% of the training cost of that using 4 ALIGNN and 4 GCN layers in this work. In other words, the relative deviation of our reproduced effect of layers on the training cost can be as large as . This is not surprising because training time is sensitive to hardware and operating system configuration and using layer combination may induce varying system loads that could lead to discrepancies in training efficiency. Training efficiency for deep learning workloads in particular can be sensitive to memory bandwidth and the ability of the dataloading pipeline to saturate the GPU.
Original paper | GCN-0 | GCN-1 | GCN-2 | GCN-3 | GCN-4 |
---|---|---|---|---|---|
ALIGNN-0 | 0.11 | 0.22 | 0.22 | 0.28 | 0.33 |
ALIGNN-1 | 0.33 | 0.50 | 0.56 | 0.56 | 0.56 |
ALIGNN-2 | 0.56 | 0.67 | 0.63 | 0.72 | 0.67 |
ALIGNN-3 | 0.67 | 0.78 | 0.78 | 0.83 | 0.78 |
ALIGNN-4 | 0.83 | 0.89 | 0.94 | 0.94 | 1.00 |
This work | GCN-0 | GCN-1 | GCN-2 | GCN-3 | GCN-4 |
---|---|---|---|---|---|
ALIGNN-0 | 0.19 | 0.24 | 0.28 | 0.31 | 0.35 |
ALIGNN-1 | 0.33 | 0.41 | 0.44 | 0.47 | 0.49 |
ALIGNN-2 | 0.51 | 0.59 | 0.60 | 0.63 | 0.64 |
ALIGNN-3 | 0.68 | 0.77 | 0.79 | 0.82 | 0.84 |
ALIGNN-4 | 0.83 | 0.95 | 1.03 | 0.98 | 1.00 |
With the above results of the model performance and training time, we can construct an accuracy-cost Pareto plot as shown in Fig. 2. Deviations between the original and reproduced Pareto fronts occur when the normalized training time is below 0.6. On the other hand, the layer configurations on the Pareto fronts are largely the same: the original Pareto fronts include (0,1)-(0,2)-(0,3)-(0,4)-(1,1)-(1,2)-(1,4)-(1,3)-(2,2), with the first and second numbers indicating the numbers of the ALIGNN and GCN layers; the reproduced Pareto front includes (0,1)-(0,2)-(0,3)-(0,4)-(1,1)-(1,2)-(1,3)-(1,4)-(2,1). The sources of deviations include the uncertainty in the model performance related to the model parameter initialization and potential sensitivity of training efficiency to changes in memory bandwidth relative to the compute capability of the different GPUs. These factors may influence the performance ranking, training cost ranking, and ultimately the Pareto fronts.
Fig. 2 ALIGNN accuracy-cost ablation analysis for the JARVIS-DFT formation energy models. The values of the plotted normalized MAE and training time per epoch are given in Tables 1 and 2, respectively. The numbers in square brackets (or parentheses) indicate the corresponding numbers of the ALIGNN and GCN layers for the reproduced (or original) points. The solid lines indicate the original and reproduced Pareto fronts. |
Nonetheless, minor disparities in model performance and training times emerge. The variations observed in model performance likely stem from the innate variance associated with the random initialization of the model parameters and the stochastic training process. The discrepancy in training time, on the other hand, is likely attributable to external factors such as hardware configurations and operating system variances. In addition, we cannot completely dismiss the potential influence of updates to the codebase on these discrepancies. Indeed, in a practical setting it is recommended to re-optimize certain training pipeline hyperparameters, such as the batch size, to maximize training throughput on the available hardware. Typically this also requires re-tuning the learning rate and regularization hyperparameters, which works against the goal of using the same hyperparameters as in the original work.
From the perspective of reproducibility, it would be advantageous to make available a snapshot of the code and data version utilized in any published computational research. Primarily, this would eliminate the potential influence of codebase updates when determining the root causes of any discrepancies between the original and reproduced results. For instance, the Zenodo repository supports automated Github code release archiving that can be referenced with digital object identifiers (DOIs). As an example, the snapshot of the Github code used in this work is archived at https://zenodo.org/records/10042567 with a DOI of 10.5281/zenodo.10042567.
Additionally, providing the snapshots would simplify the task of correctly setting up the package installation. Indeed, the frequency of updates to the installation guide is typically less than that of codebase updates, and the compatibility checks between newer versions of dependencies are conducted less regularly. This could make the installation process more susceptible to unforeseen issues, which may be hard to solve for new users. For this work, we find that a smooth setup of a workable installation appears to be more challenging than reproducing the ALIGNN results; the latter is straightforwards by simply following the ALIGNN tutorial.
Another good practice would be to provide the data identifiers used in the training-validation-test splits, which can guarantee that exactly the same data splits are used in the model training and evaluation. While the effects of different random splits may be small especially for large data, using the same data splits can remove such uncertainty and enhance reproducibility, and is a common practice in the community benchmarks (due to the need for fair model comparison).22,24,25 Future ML work on existing datasets could use the same data splits as in those benchmarks to avoid this additional reporting task. Alternatively, it is straightforwards to generate such outputs in the ML pipeline and include them as a part of the ESI,† as is done in this work.
Another challenge surfaced in reproducing the ablation analysis when the model training failed due to the number of ALIGNN layers being set to zero. This failure was traced back to an inappropriate data type check in the latest code. Although the remedy in this case required merely a one-line correction, it could prove challenging for users unfamiliar with the code or uncertain about potential side effects from making such modifications. Providing a snapshot of the version would therefore bolster the reproducibility study by mitigating such unforeseen issues that could emerge during the codebase update.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00064a |
This journal is © The Royal Society of Chemistry 2024 |