Christopher K. H.
Borg
*a,
Eric S.
Muckley
a,
Clara
Nyby
a,
James E.
Saal
a,
Logan
Ward
b,
Apurva
Mehta
c and
Bryce
Meredig
a
aCitrine Informatics, Inc., Redwood City, CA, USA. E-mail: jsaal@citrine.io
bArgonne National Laboratory, Lemont, IL, USA
cSLAC National Accelerator Laboratory, Menlo Park, CA, USA
First published on 2nd February 2023
The predictive capabilities of machine learning (ML) models used in materials discovery are typically measured using simple statistics such as the root-mean-square error (RMSE) or the coefficient of determination (r2) between ML-predicted materials property values and their known values. A tempting assumption is that models with low error should be effective at guiding materials discovery, and conversely, models with high error should give poor discovery performance. However, we observe that no clear connection exists between a “static” quantity averaged across an entire training set, such as RMSE, and an ML property model's ability to dynamically guide the iterative (and often extrapolative) discovery of novel materials with targeted properties. In this work, we simulate a sequential learning (SL)-guided materials discovery process and demonstrate a decoupling between traditional model error metrics and model performance in guiding materials discoveries. We show that model performance in materials discovery depends strongly on (1) the target range within the property distribution (e.g., whether a 1st or 10th decile material is desired); (2) the incorporation of uncertainty estimates in the SL acquisition function; (3) whether the scientist is interested in one discovery or many targets; and (4) how many SL iterations are allowed. To overcome the limitations of static metrics and robustly capture SL performance, we recommend metrics such as Discovery Yield (DY), a measure of how many high-performing materials were discovered during SL, and Discovery Probability (DP), a measure of likelihood of discovering high-performing materials at any point in the SL process.
Materials discovery offers unique challenges as an ML application area, as evidenced by the extensive works compiled in this virtual issue of ACS Digital Discovery.18 For example, materials of interest often exhibit properties with extreme values, requiring ML models to extrapolate to new regions of property space. This challenge, and methods to address it, have been discussed previously.19,20 Another challenge is representing a material suitably for input to an ML algorithm, either by incorporating domain knowledge, or learning representations from data. Chemical composition-based features,21,22 have become widely used in materials discovery, but it is likely that further headroom exists for optimization of materials representations. Finally, many materials informatics applications suffer from lack of data. While there have been many large scale data collection efforts,23–25 researchers often extract data by hand to build informative training sets,26–32 which is a highly time-consuming process. These unique challenges motivate the need for greater insight into the potential for success in a given SL-driven materials discovery effort.
Typically, the performance of ML is measured by the improvement in predictive capability of a model, using accuracy metrics such as root-mean-square error (RMSE) and the coefficient of determination r2. While these metrics provide robust estimates for predictive capabilities against a defined test set, their connection to the ability of SL to identify new, high-performing materials is unclear. In recent studies, the number of experiments necessary to identify a high-performing material has been used as a metric for monitoring SL performance.8,16,33 Modeling benchmark datasets and tools, such as Olympus,34 MatBench,35 and DiSCoVeR,36 have started to standardize assessment of model and dataset performance. Notably, a recent study by Rohr et al.37 considers additional metrics that quantify SL performance relative to a benchmark case (typically random search). Rohr et al. focuses their study on identifying top 1% materials from high-throughput electrochemical data and subsequent research expands on this work to compare performance across a variety of SL frameworks and datasets.38,39 Here, we build upon these works by investigating SL performance metrics for different design problems and specific targets within those design problems. We compare our approach to Rohr et al. in more detail in Section 2.3.
In this work, we explore in more detail the topic of SL performance metrics, generalizing conclusions across multiple datasets and design target ranges. In particular, we identify a decoupling between traditional model error metrics and a model's ability select high-performance materials. We look at three SL performance metrics: Discovery Acceleration Factor (DAFn), the average number of SL iterations required to identify n materials in a target range, Discovery Yield (DY(i)), the number of materials in a target range identified after i SL iterations (normalized by the total number of materials in the target range), and Discovery Probability (DP(i)), the average number of targets found at a given SL iteration, i. Each metric is focused on the ability of an SL strategy to identify high-performance materials, rather than the error associated with model predictions. We then demonstrate use of these metrics with a simulated SL pipeline using a commonly available band gap database.35 Next, we focus on the challenge of ML-driven design of thermoelectric (TE) materials. Fundamental TE properties (Seebeck coefficient, electrical conductivity, and thermal conductivity) were extracted from the Starrydata platform40 and used to compute TE figures of merit (ZT and σE0).40 The same simulated SL pipeline was used for these new datasets and performance metrics were compared to identify the optimal design strategies for TE materials discovery from existing data. We then compare the SL performance metrics and traditional model error metrics to identify general trends across multiple materials domains and compare these results to prior work.
To initialize simulated sequential learning, the SL pipeline must be supplied with a set of user-defined SL configuration parameters, a dataset that contains a set of inputs (typically chemical composition) and a property to be optimized. For a given dataset, chemical compositions were featurized with the Magpie elemental feature set21 implemented via the element-property featurizer in Matminer.41 For all discovery tasks, random forests were employed with uncertainty estimates (estimated by calculating the standard deviation in predicted values across estimators) and paired with an acquisition function to enable the selection of a candidate from the design space, as detailed below.
In this work, the following sequential learning configuration was used: ntest size = 10% of dataset, n0 = 50, niter = 100, nbatch = 1, ntrials = 100. This configuration was thought to be well-aligned with traditional materials discovery problems. For example, initial training sets were limited to 50 points, as it is expected many experimental studies wishing to employ ML would have at least this number of datapoints. In our SL workflow we set nbatch = 1 as experiments are often performed one at a time, however, this parameter is adjustable to values greater than 1. By design, the first step in the SL process is the selection of the training set (as opposed to selecting training set points and then conducting a round of SL); therefore, an SL workflow where niter = 100 has a single step to select training set points followed by 99 rounds of SL.
(1) Discovery acceleration factor (DAFn): the average number of SL iterations required to identify n compounds in the target range, relative to random search. For example, if RS takes 10 iterations to identify a target compound and EV takes 5 iterations, then DAF1(EV) = 2, indicating that EV identifies a target compound twice as fast as RS. The number of iterations for RS for DAF1, DAF3, and DAF5 were estimated to be 10, 30, and 50 respectively as each target range is comprised of a decile of the dataset. Rohr et al.37 proposed “acceleration factors” to refer to the number of iterations needed to achieve a particular SL performance goal (e.g., a desired DY or DP value), normalized by the number of iterations needed for RS to accomplish the same goal. Acceleration Factor (AF) is also more broadly defined as the “reduction of required budget (e.g. in time, iterations, or some other consumed resource) between an agent (model + acquisition function) and a benchmark case (e.g. random selection) to reach a particular fraction of ideal candidates”.38 In our case, DAFn is specific to the number of iterations to find n target materials relative to random search.
The equation for DAFn is given by:
(1) |
(2) Discovery yield (DY(i)): the number of compounds in the target range identified after a set number of iterations divided by the total number of compounds in the target range. For example, the band gap 10th decile range is comprised of 193 compounds (after removing the holdout set). On average, after 20 iterations, DYi=20(EI) = 0.07 ± 0.02; indicating ≈7% of targets were discovered. This is meant to represent, for a given number of experiments, how many high-performing compounds a researcher could expect to find. Rohr et al.37 proposed allALMi as the “fraction of the top percentile catalysts that have been measured by cycle i”. We interpret allALMi to be an equivalent figure of merit to DY(i) but applied to identifying top 1% materials (rather than a decile window).
The equation for DY(i) is given by:
(2) |
(3) Discovery probability (DP(i)): the likelihood of finding a target at a given SL iteration. For example, as shown in Fig. 4, after one iteration of using EI to identify 10th decile band gap materials, DP(1)(EI) = 0.4 (i.e., 40 out of 100 trials identified a target compound after 1 SL iteration). In contrast, after 99 iterations, DP(99)(EI) = 0.6 (i.e., 60 out of 100 trials identified a target compound after 99 SL iterations). This is meant to estimate the likelihood of identifying a target compound at every point in the SL process, independent of previous SL cycles. In contrast to DY(i), DP(i) may increase or decrease with increasing SL iterations.
The equation for DP(i) is given by:
(3) |
Consequently, DP(i) is also the derivative of DY(i) with respect to iterations when correcting for the total number of targets in the dataset. The equation is given by:
(4) |
(4) Root-mean-square error (RMSE): the standard deviation of the residuals calculated from the difference between predicted and actual values of compounds in the holdout set (i.e., compounds removed from the dataset before SL process is initiated). The RMSE is given by
(5) |
(5) Non-dimensional model error (NDME) = RMSE normalized by the error of guessing the mean value (GTME) of the training set for each holdout test point. NDME is a means to directly compare model accuracy across different properties. This guess the mean error (GTME) is defined as
GTME = std(yholdout_set) | (6) |
(7) |
1-NDME2 is approximately equivalent to the coefficient of determination (r2). However an important difference between the two metrics is that NDME is normalized by the variance of a sample of a dataset (in this case, the holdout set), whereas r2 is normalized by the variance of an entire dataset.
Properties for thermoelectric materials were extracted from the Starrydata2 thermoelectrics database.40 New records are uploaded daily and at the time of extraction (August 9th, 2021), data from 34738 samples were retrieved. Due to the nature of how thermoelectric properties are reported in literature, a TE data extraction and property calculation pipeline was developed for this work. This pipeline can be used to extract any number of TE properties available in Starrydata2 at arbitrary temperatures. For sparse records, properties were calculated when possible (e.g., if a given record was missing ZT but the Seebeck coefficient, electrical conductivity, and thermal conductivity were given, ZT was calculated from those extracted values). For all properties, if there was more than one reported value for a particular composition, we took the mean-value to be the ground-truth. Additionally, we focus on “111-type” compounds, defined as materials containing at least 3 metallic elements in ratios close to that of 1:1:1. This resulted in dataset sizes of 626, 618, and 705 for ZT, κtotal, and σE0 respectively. More details of the TE data extraction and property calculation pipeline are given in the ESI.†
Fig. 2 (a) Histogram of experimental band gaps of crystalline compounds from the MatBench benchmark dataset,35 with the training and target ranges highlighted for the 10th decile target range. Note: this is the distribution of points after holding out a random 10% of the dataset to be used as a test set (n = 2154 − 215 = 1939). (b) Distribution of training set sourced from points outside the target range (n = 50). (c) The resulting candidate pool of points not selected for training and points in the target range (n = 1889). |
The discovery acceleration factor (DAFn) to find 1, 3, and 5 compounds for every combination of decile target range and acquisition function were recorded. The results are shown in Fig. 3, where DAFn is presented as a heatmap such that DAFn > 1 (faster than RS) ranges from white to blue and DAFn < 1 (slower than RS) ranges from white to red. DAFn = 1 implies that the acquisition function identifies n targets at a rate equal to that of RS. Our results in Fig. 3 demonstrate that the performance of each acquisition function is strongly dependent on the target decile within the property distribution. For example, EI performs particularly well compared to RS when targeting multiple very high and very low band gap values (DAF5 = 5.6 and 5.0 for 1st and 10th deciles, respectively). However, its advantage relative to RS is much smaller when targeting intermediate band gap values (DAF5 = 2.9 and 3.0 for 5th and 6th deciles, respectively). Fig. 3 also shows that, as one might anticipate, the performance advantage of ML-guided discovery increases for extreme values in the property distribution after finding the first such extreme value. This increase is clearest at the high and low extreme value regions, 1st and 10th decile materials, where EI and EV see a 2-fold increase in performance from identifying a single target material (≈2× faster than RS) to identifying 5 target materials (≈4× faster than RS). Interestingly, despite these two target ranges having different distributions in the dataset (as shown in Fig. 3a), there is a similar acceleration to reach 3 or 5 target compounds, implying that the property value width of a target decile range is not correlated to difficulty in identifying materials in that range. MU, regardless of the target decile, is the slowest to finding target compounds, and in deciles below the 8th decile, is slower than RS, implying a design strategy based only on uncertainty estimates would be unsuccessful. However, EI appears to be quicker than EV across all target ranges, indicating that the uncertainty-aided acquisition function can generally hasten discovery activities, at least for this dataset.
While Fig. 3 illustrates the performance of sequential learning in discovering a target number of compounds, we now consider an alternative scenario: a fixed number of sequential learning iterations. In Fig. 4, we set niter = 100, and calculate performance metrics across the 100 iterations for the 1st and 10th decile target ranges. This approach simulates the situation where a certain number of experiments can be performed and the optimum design strategy needs to be identified. Interestingly, we find that no single acquisition function performs best in all cases and, for some targets (e.g., 1st decile materials), strategies that identify many target materials do not significantly reduce model error (relative to less performant strategies). In Fig. 4, the Discovery Yield (DYi, i = 1–100) is shown for 1st decile (panel a) and 10th decile (panel d) compounds, providing insight into the relative success in discovering target compounds using different acquisition functions. Panels (b) and (e) show the NDME as a function of SL iteration. For each acquisition function, the number of target compounds found increases with SL iteration and the NDME decreases with SL iteration. However, the rate at which these performance improvements occur depends on the acquisition function, the target range, and SL iteration number. Therefore, the rate of successful target compound identification (represented by the Discovery Probability (DPi, i = 1–100) in panels (c) and (f)) is plotted against NDME to directly compare the relative improvements to materials discovery and model accuracy. For both target ranges, EI and EV find the most target compounds (i.e., have the largest DY100 and DP100 values) of the four acquisition functions. However, for model improvement, the best acquisition function depends on the target range, with MU as the most efficient at reducing NDME for the 1st decile, but EI and EV demonstrating superior performance for the 10th decile. For the 1st decile, these performance metrics indicate that a decision must be made between ability to discover target compounds and model accuracy, as no single acquisition function does best in improving both. In contrast, for the 10th decile, EI happens to outperform the other functions along both dimensions. These results highlight the importance of assessing SL performance with metrics related to discovery of target compounds, rather than just model accuracy, as the two sets of metrics are not necessarily correlated with one another.
Histograms of the 111-type thermoelectric properties (ZT, κtotal, and log(σE0)) in our dataset are shown in Fig. 5a–c, illustrating a unique distribution for each property. In general, small values of ZT are most plentiful; κtotal exhibits a broad peak at small and intermediate values; and log(σE0) has a peak at large values with negative skew. The Discovery acceleration factor to identify 1, 3, or 5 target materials for the three thermoelectric properties are shown in Fig. 5. Notably, we observe that it is more difficult to identify high-performing materials (large ZT, large log(σE0), and small κtotal) than low-performing materials across all three properties. EI and EV are almost always faster at identifying target materials than RS, with notable exceptions being large log(σE0) and small κtotal. This may indicate the difficulty of finding new high-performance thermoelectric materials using these design metrics. By increasing the number of targets, EI and EV tend to improve relative to RS, although a few exceptions to this trend are apparent. For example, when targeting κtotal 1st decile materials, only MU exhibits improvement over RS when identifying 3 or 5 target materials and offers greater performance than EV or EI. This is particularly interesting as MU does not perform well for high-performing ZT or log(σE0) targets. While EI appears to be a strong default acquisition function across properties, our results here suggest that an opportunity exists to tailor acquisition functions to specific problems, or to tune acquisition functions on the fly during SL processes.
The results of simulating SL to find high performing thermoelectric materials (10th decile ZT, 1st decile κtotal, and 10th decile log(σE0)) for 100 iterations are shown in Fig. 6. For ZT, EI and MU appear to be the highest-performing design strategies, with better ability to improve both model accuracy and target identification over EV, indicating a need for model uncertainty information in ZT design. In contrast, a choice must be made for log(σE0) design to either identify high-performance materials (with EI) or improve model accuracy (with MU). κtotal has the unique behavior that MU performs best for both improving model accuracy and finding target compounds, indicating a particular difficulty in modeling thermal conductivity. Amongst the three properties, κtotal has the lowest Discovery Yield and highest NDMEs. A further observation is that, across many use cases we examine here, EV is not the most performant acquisition function. However, an important use case for EV is apparent in Fig. 5: discovery at early rounds SL iterations. If resource constraints limit a materials discovery effort to only a few candidate evaluations, EV may be the appropriate choice of acquisition function for certain tasks (e.g., finding low thermal conductivity materials).
Fig. 7 Discovery probability plots for TE properties in datasets with varying metadata (marker size increases with SL iteration number): (a–c) all 111-type samples (presented Fig. 6, (d–f) bulk only samples, (g–i) adding sample form as an input to ML model training.) The error bars (denoted by the transparent colored bands) represent the standard deviation between trials (ntrials = 100). |
The results of the running the SL workflow with different amounts of metadata are presented in Fig. 7 along with the original workflow for comparison. The most notable differences are seen between the bulk samples and all samples. For κtotal, EI results in the largest DP(i) when conducting the SL workflow on only bulk samples, while MU results in the largest DP(i) when conducting the SL workflow on all samples or when including sample form as an input. This finding correlates with experimental observations,50 which identified a microstructural effect on thermal conductivity for 111-type compounds and suggests that high-performance SL for κtotal is only possible with a high-quality description of microstructure. Further, this illustrates that microstructural differences have a direct effect on SL outcomes and performance, highlighting the importance of integrating domain knowledge into ML material representations. Additionally, we note that EV performs relatively poorly in both workflows, indicating the benefits of including uncertainty estimates with ML predictions. For log(σE0), EI is less performant when conducting the SL workflow on only bulk samples, suggesting that changing the shape of the target property distribution can have consequences for performance. It may also be important to examine the behavior of SL for different quantities of training data. Including “sample form” as an input did not change the SL results meaningfully when compared to a composition-only model. This is likely due to a large number of samples having no “sample form” label, as over 50% of samples had empty or unknown entries for this label.
Our results demonstrate that SL performance depends sensitively on the target range within the property distribution, how many SL iterations are allowed, the number of targets to discover, and the incorporation of uncertainty estimates in the SL acquisition functions. The nature of the design problem has a large effect on the possibility of SL success. The heatmaps in Fig. 3 and 5 indicate a large variation in DAFn to find targets for a given property amongst the different target ranges. For instance, finding one material within the 10th decile range of ZT values is much more difficult than finding one in the 9th decile range. This is likely dependent on the distribution of materials property values in the dataset, particularly in the feature space where the models are trained. Such heatmaps can be used to optimize model architecture and design problems for the particular distribution of a given dataset. For instance, it appears particularly difficult to identify 1st decile κtotal materials (Fig. 5 middle column), at least without knowledge of microstructure (as suggested in Fig. 7). Thus, one possible screening strategy could be to filter out materials predicted to lie in the 2nd to 10th deciles and focus on the remaining candidates.
Simulated sequential learning on existing data can inform the selection of an acquisition function for ongoing SL. Based on DAFn heatmaps (such as those in Fig. 3 and 5), the difficulty of a particular target range can be assessed. DY(i) and DP(i) plots, such as those in Fig. 4 and 6), can be used to estimate the trade-off between materials discovery and model accuracy at early and later iterations of SL. The optimal discovery approach depends on the particular materials design goal at hand and no single acquisition function necessarily performs best in all cases, although EI appears to be a robust default choice.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2dd00113f |
This journal is © The Royal Society of Chemistry 2023 |