Quinn M.
Gallagher
and
Michael A.
Webb
*
Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544, USA. E-mail: mawebb@princeton.edu
First published on 3rd December 2024
Active learning and design–build–test–learn strategies are increasingly employed to accelerate materials discovery and characterization. Many data-driven materials design campaigns require that materials are synthesizable, stable, soluble, recyclable, or non-toxic. Resources are wasted when materials are recommended that do not satisfy these constraints. Acquiring this knowledge during the design campaign is inefficient, and many materials constraints transcend specific design objectives. However, there is no consensus on the most data-efficient algorithm for classifying whether a material satisfies a constraint. To address this gap, we comprehensively compare the performance of 100 strategies for classifying chemical and materials behavior. Performance is assessed across 31 classification tasks sourced from the literature in chemical and materials science. From these results, we recommend best practices for building data-efficient classifiers, showing the neural network- and random forest-based active learning algorithms are most efficient across tasks. We also show that classification task complexity can be quantified by task metafeatures, most notably the noise-to-signal ratio. These metafeatures are then used to rationalize the data efficiency of different molecular representations and the impact of domain size on task complexity. Overall, this work provides a comprehensive survey of data-efficient classification strategies, identifies attributes of top-performing strategies, and suggests avenues for further study.
Materials optimization often targets a constrained domain. Consequently, resources can be wasted on candidates unsuitable for further characterization. Common constraints on materials domains include synthesizability, unwanted phase behavior, instability, and toxicity. For example, when surveying a polymer library for enzyme-stabilizing candidates, Tamasi et al. encountered phase-separating or aggregating polymers unsuitable for physical assays with the target enzyme.9 Likewise, Körbel et al. surveyed 1276 hybrid organic-inorganic halide perovskites of the form A+B2+X3−, from which only 203 compounds were considered stable for further density functional theory calculations.16 An et al. sought to find peptide sequences that would form condensed phases and disparate dynamical properties,7 yet no phase-separating systems were identified in an initial survey of 1266 peptides listed in the DisProt database.17 Ideally, such behavior would be known or predicted from the outset and incorporated into the data-selection process for any given design campaign. Additionally, knowledge of materials classification can be applied across varied design objectives. Therefore, a viable strategy is to allocate a portion of the resource budget to accurately classify viability within a materials domain, avoiding wasted resources on unsuitable candidates. To maximize resource use, it is desirable to use a data-selection strategy and classification algorithm that achieves the highest accuracy with the fewest measurements.
Numerous and varied classification schemes can be found across the literature. Terayama et al. used uncertainty-based active learning to build phase diagrams of H2O, glass-ceramic glazes, block copolymers, and more18–20 using label propagation, a semi-supervised machine learning model.21 Citing the computational expense of the label propagation algorithm, Telleria-Allika et al. used a random forest-based active learning scheme to build magnetic and covalency phase diagrams for few electron Hooke atoms and helium dimers.22 Dai and Glotzer used active learning based on a Gaussian process least-squares classifier and a novel acquisition function to learn the phase diagram of active Brownian particles and quasi-crystals.23 Hickman et al. used Gaussian processes to simultaneously classify viability and optimize performance for several materials design tasks, including small molecule drugs and perovskites.24 Focusing on the low-data regime, Bhat and Kitchin used heuristics, rather than active learning, to identify classification boundaries in several engineering problems, asserting that active learning would be ineffective in their low-data limit.25 Other works have continued the trend of applying novel active learning schemes to custom design tasks.26–30 The diversity of considered tasks and proposed algorithms indicates no consensus on what constitutes an optimal approach or how to select reasonable strategies.
Here, we investigate the performance of various algorithms across a set of 31 classification tasks primarily sourced from chemical and materials science. From these results, we identify algorithms that perform optimally and the attributes that lead to maximum data efficiency. We also explore approaches to building classification algorithms that are robust to task variation. To explain algorithm performance across tasks, we demonstrate that metafeatures (i.e., properties of classification tasks) predict an algorithm's performance, with a few metafeatures strongly correlating with classification accuracy regardless of algorithm choice. Additional metafeature analysis demonstrates why a limited set of physico-chemical descriptors can outperform common high-dimensional representations and also highlights the influence of domain size on task complexity. Through this study, we identify best practices for selecting data-efficient classification algorithms and explain why these practices improve performance.
Name | Size | Dim. | Domain | Label | Ref. |
---|---|---|---|---|---|
bace | 1513 | 10 | Small molecules | Inhibition of human β-secretase 1 | 31 |
bear | 1800 | 4 | 3D-printed structures | High mechanical toughness | 32 |
clintox | 1480 | 10 | Small molecules | FDA approval | 31 |
diblock | 5376 | 3 | Diblock copolymers | Lamellar phase | 33 |
electro | 285 | 4 | Electrocatalysts | High stability | 34 |
esol | 1128 | 10 | Small molecules | Low aqueous solubility | 31 |
free | 642 | 10 | Small molecules | Low hydration free energy | 31 |
glotzer_pf | 10![]() |
2 | ABP phase diagram (constant PF) | Phase separating | 23 |
glotzer_xa | 10![]() |
2 | ABP phase diagram (constant xA) | Phase separating | 23 |
hiv | 7215 | 10 | Small molecules | Active HIV inhibitors | 31 |
hplc | 1385 | 5 | HPLC process parameters | Low photodegradation | 35 |
lipo | 4200 | 10 | Small molecules | Low lipophilicity | 31 |
muv | 5000 | 10 | Small molecules | Toxicity | 31 |
oer | 2121 | 6 | OER catalysts | Low overpotential | 36 |
oxidation | 1275 | 2 | Ternary alloys | Oxidation susceptibility | 25 |
perovskite | 1276 | 14 | Perovskites | Stability | 24 |
polygel | 9856 | 9 | Polymethacrylates | Predicted solubility | |
polysol | 6524 | 11 | Common polymers and solvents | Solubility | 37 |
princeton | 6390 | 2 | Princeton “P” | Inside the “P” | |
qm9_cv | 6695 | 10 | Small molecules | Low CV | 38 |
qm9_gap | 6695 | 10 | Small molecules | Low band gap | 38 |
qm9_r2 | 6695 | 10 | Small molecules | Low spatial extent | 38 |
qm9_u0 | 6695 | 10 | Small molecules | Low internal energy at 0 K | 38 |
qm9_zpve | 6695 | 10 | Small molecules | Low ZPVE | 38 |
robeson | 353 | 10 | Linear homopolymer membranes | Above the 1999 Robeson bound | 39 |
shower | 625 | 2 | Flow rates | Satisfactory temperature | 25 |
toporg | 1342 | 8 | Polymer topologies | Low radius of gyration | 40 |
tox21 | 7831 | 10 | Small molecules | Toxicity | 31 |
vdw | 625 | 2 | Thermodynamic conditions | Phase separation | 25 |
water_hp | 625 | 2 | Thermodynamic conditions (high P) | Ice | 19 |
water_lp | 625 | 2 | Thermodynamic conditions (low P) | Liquid water | 19 |
Some tasks require a molecular representation. For these tasks, molecules are represented as the ten most informative physico-chemical features calculated by the Mordred descriptor calculator41 for the given property. The chosen descriptors are selected by training a logistic regression model with an L1 loss on the full dataset, with molecules represented by all available Mordred descriptors, and keeping the ten descriptors with the largest absolute coefficients. This scheme emulates molecular design campaigns that use a set of expert-informed features as a molecular representation.42 Viable alternatives to this choice of molecular representation, like graphs43 and physics-informed structural representations,44 are not considered in this study. The impact of alternative molecular representations is examined in Section 4.7.
Some tasks are prepared from datasets with continuous properties. For these datasets, the task is to classify elements of the domain with property values below the 20th percentile of the property distribution. Some tasks, like those derived from QM9,38 are taken from large datasets that would be too computationally intensive for exhaustive consideration in our high-throughput survey. If such datasets have continuous properties as labels, we subsample the dataset with stratification to preserve the property distribution. If such datasets have discrete properties as labels, we subsample the dataset, such that the minority class is approximately 20% of the observations. While the high-throughput survey is restricted to datasets with domain sizes of 10000 or fewer, the impact of domain size is further examined in Section 4.8.
While the random sampler chooses points at random, non-random samplers choose points that optimize a specific metric. Maximin sampling, also called furthest-point sampling, sequentially selects points that maximize the minimum Euclidean distance between the current point and all previously chosen points. A medoids sampler chooses the centroids produced by the k-medoids algorithm, which selects a set of points that minimizes the average squared Euclidean distance between any point in the domain to a point in the sample. A max entropy sampler, a method created by Paiva,48 chooses a maximally informative set of points by sequentially selecting the point in the domain x* to solve
![]() | (1) |
![]() | (2) |
The BKDE model is inspired by the Gryffin52 and Phoenics53 algorithms. The kernel density of each point is measured using the outputs of a Bayesian autoencoder fit to the training data. Specifically, the kernel density at point x due to a measured point xk can be written as:
![]() | (3) |
![]() | (4) |
For BKDE-based active learning algorithms, a custom batch-selection scheme is used due to the computational expense of refitting BKDE to new data. We define k(x) = ρk(x)/ρk,max as the normalized kernel density, so that
k(x) ∈ [0, 1].
k(x) represents the influence of point xk on every point x in the domain with a value between 0 and 1. Before batch selection, the uncertainties of every point in the domain are computed, denoted u0(x). Batch selection begins by selecting the point with the highest uncertainty, denoted x1. When this point is selected,
1(x) is computed. The uncertainties are then recomputed by reducing their magnitude by a factor proportional to the influence of x1 at that point, producing a new uncertainty function u1(x) = u0(x) × (1 −
1(x)). By consequence, uncertain points uninfluenced by x1 remain uncertain, while those near x1 are less likely to be chosen. The point x2 that maximizes u1(x) is then chosen, and the process is repeated until the desired batch size is reached. This method allows for a diverse batch of points to be selected by BKDE-based active learning algorithms without retraining the model for each acquired point.
![]() | (5) |
For any given task, what differentiates “good” from “bad” Macro F1 scores can be ambiguous. Inspired by the use of random selection as a baseline in optimization literature,55 we define a new metric, ξ, as the number of randomly selected points a nearest neighbor classifier requires to achieve the same Macro F1 score as the specified algorithm. We further define ξmax as the maximum ξ achieved by any algorithm on the task. Then, ξ/ξmax describes how close an algorithm is to the best performance on a given task. Metrics like ξ and ξ/ξmax quantify efficiency in terms of resources saved by employing a given algorithm compared to a naive approach.
Predictive metafeatures for each algorithm are identified by fitting a linear regression model of metafeatures to the algorithm's 31 Macro F1 scores across all tasks. A minimal set of predictive metafeatures common to all algorithms is determined using sequential feature addition. Sequential feature addition starts by constructing linear models of individual metafeatures for all algorithms. The metafeature ψ1 that results in the lowest mean absolute error (MAE) is added to the set of selected metafeatures, with MAE computed via leave-one-out cross-validation. The process is repeated with combinations of ψ1 and additional metafeatures, adding the metafeature ψ2 that results in the lowest MAE. This iterative process continues until MAE decreases by less than 1%. The final set of metafeatures {ψi} is used to build maximally predictive linear models of algorithm performance across tasks. BKDE-based algorithms are excluded from this analysis due to the inability of metafeatures to predict their performance.
Fig. 3A shows the 20 highest-performing algorithms, by accuracy, as measured by average relative Macro F1 score across all tasks for ten rounds of active learning. NN- and RF-based active learning algorithms are the most accurate classifiers regardless of sampler choice, representing 10 out of the top 11 algorithms. Most variants of XGB-based active learning algorithms are also present in the top 20, along with a few GP- and SV-based active learning algorithms. Space-filling algorithms are notably missing from the top performers, suggesting the value of iterative data acquisition, which is further analyzed in Section 4.3. The choice of sampler does not clearly affect performance of these algorithms, with roughly equal representation of all samplers. The presence of all NN- and RF-based active learning algorithms in the top 20 suggests that choice of model is more important than choice of sampler. While the results are statistically robust, we note that the top 20 most accurate algorithms differ in relative Macro F1 scores by at most ca. 0.04; the practical implication of such a difference would require additional external evaluation.
To better characterize the data efficiency of algorithms, we consider 〈ξ/ξmax〉, the performance relative to a naive algorithm that achieves equivalent accuracy. For example, on a given task, an algorithm may achieve a Macro F1 score of 0.7, while the best algorithm achieves a Macro F1 score of 0.8. The naive method may require 1000 measurements to reach the score of the first algorithm and 2000 measurements to match the best algorithm. In this case, the value of ξ/ξmax is given by 1000/2000 = 0.5. This process can be repeated for each random seed of every algorithm applied to each task, enabling estimation of 〈ξ/ξmax〉 across all tasks.
Fig. 3B ranks the top 20 algorithms by 〈ξ/ξmax〉 for all tasks after ten rounds of active learning. Fig. 3B shows that NN-based active learning algorithms are the clear top performers, regardless of sampler, when using this metric. RF-based active learning algorithms follow closely behind, followed by a variety of Gaussian process-based active learning methods. Compared to Fig. 3A, the metric in Fig. 3B provides greater stratification in algorithm performance for high values of Macro F1. As F1 scores tend to 1.0, more and more points are required by a naive algorithm to improve its accuracy, which is reflected only by small increases in Macro F1 score. When appropriately weighting the relative “effort” required for getting a high-resolution understanding of the task, NN-based active learning algorithms emerge as a consistent top performer. However, the maximum value of 〈ξ/ξmax〉 achieved by any algorithm is less than 0.8, indicating that even the top-performing algorithms are not necessarily optimal for many tasks.
The ordering in Fig. 3 reflects an average across all tasks and specifically follows after ten rounds of active learning. Variants of Fig. 3 for different subsets of tasks and fewer points selected are available in the ESI (see Section S2†). We find that performance varies depending on the dimensionality of the tasks. Fig. S1† shows that when only low-dimensional tasks (d ≤ 8) are considered, NN-based active learning algorithms greatly outperform all alternatives. Fig. S2† shows that when higher dimensional tasks are considered (d > 8), tree-based algorithms perform better, and there is not a clear advantage to using either NN- or RF-based active learning algorithms.
Fig. S3–S5† show how the results in Fig. 3 change when fewer points are selected. At only three rounds of active learning, space-filling algorithms with a variety of models are present in the top 20 algorithms (Fig. S3†). The top space-filling algorithm, which uses the medoids sampler and neural network model, remains in the top 20 until five rounds of active learning (Fig. S4†), closely followed by GP-based space-filling algorithms. NN-based active learning algorithms are the top-performing algorithms for three rounds of active learning onwards, while RF-based active learning algorithms do not emerge as the clear second best choice until five rounds of active learning. Results are mostly consistent with Fig. 3 for seven rounds of active learning (Fig. S5†). Therefore, the results of Fig. 3 are consistent for many rounds of active learning, but when few batches have been selected, NN-based active learning algorithms are optimal.
From these results, we suggest using NN- or RF-based active learning algorithms for building accurate classification models on domains with a limited experimental budget. RFs seem preferred for higher-dimensional tasks. This guidance seemingly runs counter to conventional wisdom regarding the relative ineffectiveness of neural networks in low-data regimes and the common utilization of Gaussian processes for AL/BO. It may be interesting to consider whether prior studies (such as ref. 18, 20 and 23–25) might be more data-efficient by opting for a different strategy.
![]() | ||
Fig. 4 Summary of suboptimal algorithms. Algorithms are considered suboptimal if ξ/ξmax < 0.9 on every task. Data is stratified by algorithm type and model choice. |
The poor performance of BKDE-based algorithms may be attributed to several factors. First, unlike GPs and SVs, BKDE does not use training labels when estimating kernel densities, reducing predictive accuracy. Second, BKDE relies on a Bayesian autoencoder to estimate kernel densities, which can be inaccurate with limited training data. Third, BKDE's kernel density estimates rapidly decay to zero with distance, leading to high uncertainties across much of the task domain. This causes BKDE-based active learning algorithms to fail in prioritizing points near classification boundaries, reducing accuracy. Consequently, BKDE-based active learning and space-filling algorithms perform similarly across tasks.
The poor performance of LP-based algorithms is likely due to two reasons. First, LP models assign classes to unlabeled points based on neighboring labeled points defined by Euclidean distance. Unlike anisotropic GPs, XGBs, RFs, and NNs, LP models do not have a mechanism to ignore irrelevant features. Second, LP models assign high uncertainties to points near an identified classification boundary but not to points far from those already chosen. As a result, LP models can miss classification boundaries not initially discovered by the sampler. This likely explains why space-filling LP algorithms outperform active learning LP algorithms. Other methods address this issue by explicitly increasing the uncertainty of distant points (e.g., GPs) or using model ensembles to encourage uncertainties in less sampled regions (e.g., RFs, NNs, XGBs).
Fig. 5 shows that active learning does not always outperform space filling, especially with few rounds of active learning. In the first round, active learning outperforms space filling in less than 50% of cases, suggesting little initial benefit. This fraction increases to about 65% by round 10. To avoid misleading results from poorly performing models, we also analyze the top-performing models from Fig. 3B. In this case, active learning outperforms space filling about 50% of the time in the first round, increasing to nearly 80% by round 10. While active learning generally outperforms space filling, these results indicate that an arbitrary active learning scheme may not always surpass its space-filling variant on a given task.
![]() | ||
Fig. 5 Controlled comparison of active learning (AL) and space filling (SF). Data corresponds to the fraction of instances that AL outperforms SF when using the same sampler, model, and seed. Data is also stratified for different rounds of AL. For a given round of AL, the AL algorithm is compared to a SF algorithm that has selected the same number of points. The fractions considering all sampler, model, and seed choices are shown in blue. The fractions considering only the top 20 sampler, model, and seed choices in Fig. 3B include the blue and orange bars. Statistics are aggregated over all tasks. |
The results in Fig. 5 depend on the tasks considered. Some tasks (e.g., princeton, tox21, electro) deviate from aforementioned trends (Fig. S6–S8†). In these cases, space-filling algorithms consistently outperform active learning algorithms. We note that these are also among the most difficult classification tasks, as indicated by the mean performance of all algorithms. When all algorithms struggle, the performance gap between active learning and space filling is less meaningful. Additionally, datasets like princeton have complex classification boundaries that benefit from allocating more experimental budget to exploring the task domain rather than refining an already discovered boundary.
Based on these results, we recommend using active learning algorithms for data-efficient classifiers but acknowledge that factors such as (i) the number of active learning rounds and (ii) the expected complexity of the classification task can influence the relative performance of active learning and space filling. Determining the optimal choice of active learning, space filling, or combinations thereof is left for future work.
Fig. 6 shows the impact of sampler choice on performance for both active learning and space-filling algorithms. For active learning algorithms (Fig. 6A), sampler choice has a minor effect, with maximin and medoids samplers providing a slight improvement over random sampling, though the difference diminishes with increasing rounds of active learning. Vendi and max entropy samplers perform similarly to or worse than random sampling, with max entropy showing lower performance overall. In contrast, for space-filling algorithms (Fig. 6B), the medoids sampler consistently outperforms other options at all training set sizes, while maximin and Vendi samplers are only modestly better than random sampling. Here, the max entropy sampler also performs worse than random selection. Notably, sampler choice has a more sustained influence in space-filling algorithms than in active learning, where additional rounds of selection reduce the initial impact of the sampler. These results suggest that while active learning reduces the dependency on the initial sampling strategy over time, for space-filling algorithms, the medoids sampler provides the most robust performance across training set sizes. Thus, the medoids sampler is recommended for both active learning algorithms with limited rounds and for space-filling algorithms generally.
Fig. 7A shows the relative performance based on 〈ξ/ξmax〉 of NN-, RF-, and ensemble-based active learning algorithms across all tasks. While ensembles rank among the top-performing algorithms, they do not consistently outperform NN-based active learning algorithms. However, the results in Fig. 7 are task-dependent, suggesting that ensemble-based active learning may be beneficial for certain types of tasks.
To determine if ensemble-based active learning algorithms outperform NN- and RF-based algorithms on specific tasks, we analyze two task sets. Fig. 7B shows tasks where NN-based algorithms are the top performers (n = 9). Fig. 7C shows tasks where RF-based algorithms excel (n = 10). Ensemble schemes generally outperform individual models on tasks for which they are not optimal. This effect is strongest for tasks where NN-based algorithms excel and less pronounced for RF-based tasks. Thus, using ensemble-based active learning may mitigate the risk of selecting a suboptimal model for any given task.
Fig. 8 shows that a limited set of task metafeatures identified by sequential feature addition can reasonably predict task complexity. Fig. 8A shows results of using just four metafeatures (noise-to-signal ratio,56 maximum weighted distance between two points in the task domain,59 maximum mutual information between features and labels, and the performance of the linear discriminant classifier) to predict the accuracy of all algorithms across all tasks. To reduce the influence of poorly performing algorithms, the same analysis is performed using just the top 20 algorithms. This yields Fig. 8B, which uses noise-to-signal ratio, the average mutual information between features and labels, and the performance of the naive Bayes classifier. In both cases, simple linear models based on these few features capture the data well.
The particular metafeatures selected resonate with intuition. Noise-to-signal ratio is the most predictive metafeature of algorithm performance across all tasks. Linear models using only this ratio achieve an MAE of 11.260% and R2 = 0.619 for all algorithms, and an MAE of 8.370% and R2 = 0.696 for top-performing algorithms. Tasks with low noise-to-signal ratios require fewer measurements because each measurement provides valuable information about labels. Related to the noise-to-signal ratio, mutual information and the performance of the naive Bayes classifier indicate how useful individual features are for predicting labels. When features are individually predictive of labels, less data is required for accurate predictions than for tasks where features are uninformative. The maximum weighted distance between point pairs59 likely identifies outliers in the task domain, which require more measurements to account for their influence. The performance of the linear discriminant classifier indicates the linear separability of a task. Linearly separable tasks have simple classification boundaries, requiring less data for accurate prediction. In simple and expected terms, less data is needed to train accurate models for tasks with informative features, few outliers, and linearly separable classes.
Fig. 9A shows the performance of active learning algorithms, grouped by model type, for different molecular feature sets after ten rounds of active learning. Increasing the number of Mordred descriptors from 10 to 20 or 100 generally improves performance. However, using all available Mordred descriptors often leads to only slight gains or even notable performance drops compared to using just 10 descriptors. Replacing 10 Mordred descriptors with 1024 bit Morgan fingerprints significantly reduces performance for all algorithms. Tree-based models (i.e., RF and XGB) exhibit smaller declines in performance with larger feature sets than neural networks or kernel-based methods, consistent with the observed superiority of tree-based models over deep learning models on high-dimensional, tabular datasets.62
To elucidate the results of Fig. 9A, we examine distributions of noise-to-signal ratio and average mutual information values obtained across tasks for each representation (Fig. 9B). As the number of Mordred descriptors increases, the noise-to-signal ratio distribution shifts upward, indicating more uncorrelated variation in the feature vectors relative to the property of interest. In contrast, Morgan fingerprints exhibit a consistently higher noise-to-signal ratio than Mordred descriptors. Conversely, the average mutual information decreases as more Mordred descriptors are added, with Morgan fingerprints displaying markedly low mutual information across all tasks, as expected for binary fingerprints. These trends suggest that larger feature sets introduce more noise, complicating model training by requiring more complex functions to predict outputs accurately. Consequently, data efficiency declines as models must process additional, often irrelevant, information. These findings support the use of minimal feature sets highly correlated with the property of interest to maximize data efficiency in classification tasks, aligning with recent work highlighting the effectiveness of using fewer features for molecular property prediction.63
Fig. 10 shows the distribution of performance changes when domain size increases. For all tasks, the performance of each algorithm (measured by Macro F1) on the smaller domain is within the standard error of its performance on the larger domain. The symmetric shape and small bandwidth of the distribution indicate that domain size does not systematically affect algorithm performance. These results suggest that in a low-data regime, where training sets are much smaller than the domain size, domain size has little impact on the data efficiency of algorithms. We speculate this is because task complexity influences algorithm performance more than domain size. To show this, we compute metafeatures predictive of algorithm performance (as determined in Section 4.6) for both small and large versions of the tasks (Table 2). The datasets exhibit minimal changes in noise-to-signal ratio, average mutual information, and maximum mutual information with increased domain size. These findings support the idea that domain size does not inherently beget task complexity, which ultimately dictates algorithm performance.
Task | Noise-to-signal ratio | Mutual information (mean) | Mutual information (max) | |||
---|---|---|---|---|---|---|
Small | Large | Small | Large | Small | Large | |
glotzer_pf | 14.4 | 18.3 | 0.285 | 0.286 | 0.285 | 0.286 |
water_lp | 15.7 | 27.8 | 0.179 | 0.192 | 0.210 | 0.203 |
qm9_gap | 65.5 | 65.1 | 0.044 | 0.056 | 0.112 | 0.110 |
qm9_r2 | 20.5 | 23.1 | 0.150 | 0.198 | 0.410 | 0.424 |
qm9_cv | 16.4 | 21.2 | 0.198 | 0.174 | 0.402 | 0.408 |
This study opens several avenues for future research. Key areas for further investigation include exploring algorithm design choices not covered here, such as feature and label transformations, batch-selection schemes, and batch sizes. Additionally, applying the current findings to materials design campaigns that involve simultaneous optimization and classification, as discussed by Hickman et al.,24 could be valuable. Beyond algorithm design, incorporating domain knowledge could enhance data efficiency. Utilizing pre-trained models, incorporating priors from foundation models, and applying physical constraints on model predictions may offer significant improvements in data efficiency compared to changes in sampler or model. Specifically, constructing pre-trained material representations optimized for metafeatures predictive of algorithm performance, like the noise-to-signal ratio, could boost data efficiency across materials domains. This approach could be beneficial for both classification and regression tasks.64 Finally, establishing a unified set of classification tasks for testing would strengthen the generalizability of the findings here and for future studies.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00298a |
This journal is © The Royal Society of Chemistry 2025 |