Comment on “A simple constrained machine learning model for predicting high-pressure-hydrogen-compressor materials” by Hattrick-Simpers, et al., Molecular Systems Design & Engineering, 2018, 3, 509†
Abstract
In this short comment we present a reproducibility study for our recent manuscript “A simple constrained machine learning model for predicting high-pressure-hydrogen-compressor materials” by Hattrick-Simpers, et al., Mol. Syst. Des. Eng., 2018, 3, 509” using a suite of open source materials data science tools. The principal goal of this study is to provide the interested reader the ability to reproduce our previous machine learning model with minimal effort and then perform predictions upon the holdout set used in that manuscript. In transcribing our model from the Java-based Magpie/Weka framework to the Python-based Matminer/scikit-learn framework we noticed an unexpected discrepancy in the predictions between the two platforms. To compare the performance of nominally equivalent random forest regression models across these two platforms, we trained and evaluated 50 replicate models for each platform using random 90% subsets of the full hydride training set for each replicate. The Magpie/Weka models showed somewhat higher predicted mean absolute error (5.6 ± 0.4) than the Matminer/scikit-learn models (4.2 ± 0.4) on the holdout set, although the validation statistics were within error of one another. It is beyond the scope of this comment to fully analyze the ultimate source of the variance in these predictions, but we speculate that some contribution results from differences in how Magpie treats duplicate compositions in the training set and/or differences in RF implementation between Weka and scikit-learn.