Vahid Attari* and
Raymundo Arroyave
Department of Materials Science & Engineering, Texas A&M University, College Station, 77840, TX, USA. E-mail: attari.v@tamu.edu
First published on 1st August 2025
Materials datasets, particularly those capturing high-temperature properties pose significant challenges for learning tasks due to their skewed distributions, wide feature ranges, and multimodal behaviors. While tree-based models like XGBoost are inherently non-linear and often perform well on many tabular problems, their reliance on piecewise constant splits can limit effectiveness when modeling smooth, long-tailed, or higher-order relationships prevalent in advanced materials data. To address these challenges, we investigate the effectiveness of encoder–decoder model for data transformation using regularized Fully Dense Networks (FDN-R), Disjunctive Normal Form Networks (DNF-Net), 1D Convolutional Neural Networks (CNNs), and Variational Autoencoders, along with TabNet, a hybrid attention-based model, to address these challenges. Our results indicate that while XGBoost remains competitive on simpler tasks, encoder–decoder models, particularly those based on regularized FDN-R and DNF-Net, demonstrate better generalization on highly skewed targets like creep resistance, across small, medium, and large datasets. TabNet's attention mechanism offers moderate gains but underperforms on extreme values. These findings emphasize the importance of aligning model architecture with feature complexity and demonstrate the promise of hybrid encoder–decoder models for robust and generalizable materials prediction from composition data.
Deep learning has shown remarkable promise across various domains, including materials science, where it has been successfully applied to predict mechanical properties of alloys,6,8,9 discover new thermoelectric materials,7 and identify phase transitions in complex multicomponent systems.10 However, while these successes highlight its potential, applying deep learning to tabular data, which is prevalent in materials science, presents unique challenges. Materials science data often spans multiple orders of magnitude, reflecting the diversity of material properties and phenomena. For example, mechanical properties like yield strength range from tens of MPa for polymers to thousands of MPa for metals and ceramics, while electrical conductivity varies from as low as 10−16 S m−1 in insulators to 107 S m−1 for conductors like copper. Similarly, thermal conductivity can range from less than 0.1 W m−1 K−1 in insulators to over 1000 W m−1 K−1 in materials like diamond, and diffusion coefficients vary from 10−25 m2 s−1 in solids at low temperatures to 10−8 m2 s−1 in liquids. Even creep behavior spans orders of magnitude under high-temperature conditions, influenced by factors such as stress, temperature, and microstructure.
Data scarcity, challenges in data preparation, predictive accuracy, and interpretability are all vital considerations in materials science applications to ensure that models provide actionable insights for experimental validation. To tackle data scarcity and confidentiality concerns, generative approaches like variational autoencoders (e.g. Tab-VAE11) or Tabular Generative Adversarial Networks12 can be employed to create synthetic datasets, enhancing model robustness. Tackling data preparation for largely skewed features can be achieved using robust transformations, such as quantile transformation or log-scaling, to ensure features are more suitable for model training.13,14 When paired with interpretable architectures15 and efficient inference strategies,16 these methods can advance the use of tabular materials data in scientific discovery and design. Recent innovations, such as transformers and hybrid models combining tree-based methods with neural networks, show promise in overcoming these challenges. Transformers leverage self-attention mechanisms to capture complex feature interactions, offering an advantage for the heterogeneous and multiscale nature of materials data. However, inconsistencies in benchmarking practices and unequal levels of optimization hinder direct comparisons with traditional methods like gradient-boosted decision trees (GBDT).17–21 To fully realize their potential, deep learning models in materials science must balance predictive accuracy with alignment to domain-specific physical principles and interpretability. Achieving this will require improved benchmarking frameworks, access to diverse and high-quality datasets, and the integration of domain knowledge into model development. These advancements could close the performance gap with traditional methods and enable deep learning to address the unique demands of materials science and unlock its potential for solving complex problems.
Several emerging neural architectures, inspired by concepts from the encoder–decoder and other hybrid methods, have shown promise in addressing the limitations of traditional models, particularly in handling tabular data workflows. For example, TabNet enhances interpretability and performance by dynamically selecting relevant features through attention mechanisms.22 Neural Oblivious Decision Ensembles (NODE) combine decision rules with neural networks to efficiently model feature interactions.17 FT-Transformers and TabTransformer employ self-attention to capture complex relationships between numerical and categorical features.23,24 Hybrid approaches like DeepGBM integrate gradient-boosting machines with neural networks to combine feature transformation and prediction.25 These advancements, along with models like Wide and Deep Networks26 and DNF-Net (Disjunctive Normal Form Networks),27 improve scalability, interpretability, and predictive performance, positioning them as robust alternatives to traditional methods.
In this work, we evaluate the potential of several deep learning architectures to outperform traditional GBDTs in accuracy and efficiency when applied to materials science tabular datasets. Our findings demonstrate that neural architectures tailored for tabular data can effectively handle the complexity and wide range of property scales often encountered in materials science. Beyond static datasets, these models exhibit promise in processing streams of data generated from high-throughput experiments or autonomous systems, making them particularly relevant for real-time analysis. Additionally, their ability to capture intricate dependencies and handle complex, multimodal distributions positions them as critical tools for addressing the inherent variability in materials data. These advancements also align with the principles of FAIR (Findable, Accessible, Interoperable, and Reusable) data, ensuring that the insights derived from these models can be leveraged across diverse research and industrial contexts. In Section 2, we detail the architectures and approaches used in this study, while Section 4 compares the performance of deep learning models and GBDTs, highlighting their respective strengths in handling complex materials datasets.
Z = fθ(X), Ŷ = gϕ(Z), | (1) |
![]() | (2) |
Training is typically performed using gradient-based methods. After training, the model can be used to transform new data Xnew into predictions Ŷnew:
Ŷnew = gϕ(fθ(Xnew)) | (3) |
This baseline encoder–decoder approach is not inherently interpretable, primarily due to the lack of transparency in its latent representations. However, with attention mechanisms, feature analysis tools, and visualizations, it can become explainable, providing some insights into its decision-making process. Next, we will discuss three neural network architectures: regularized fully connected dense networks, Disjunctive Normal Form Networks (DNF-Nets), and Convolutional Neural Networks (CNN). These architectures, when combined with encoder–decoder frameworks, can enhance prediction accuracy by leveraging latent space representations and reducing noise. Interpretability is further improved through feature selection in DNF-Nets and filter visualization in 1D-CNNs. Interpretability ensures trust and accountability by enabling the understanding of model predictions, feature importance, and decision-making processes, which is particularly crucial for debugging models, achieving regulatory compliance, and gaining user acceptance.28,29
yi = f(Wix + bi) | (4) |
![]() | (5) |
ŷi = fdecoder(Wizi + bi) | (6) |
The objective of the variational encoder–decoder is to maximize the evidence lower bound (ELBO), which consists of two terms: the reconstruction loss (e.g., mean squared error between the input x and the reconstruction ŷ) and the Kullback–Leibler (KL) divergence between the learned latent distribution and a standard normal distribution. The full loss function can be written as:
![]() | (7) |
The parameter set θ in a variational encoder–decoder block includes the weights and biases of the encoder and decoder, the mean and variance parameters μi and σi, and any regularization terms used to control overfitting.
![]() | (8) |
![]() | (9) |
![]() | (10) |
We performed hundreds of optimization trials per dataset, tuning between 5 and 10 hyperparameters depending on the architecture. These parameters and their range are listed in Table S4. Each trial was evaluated over a fixed training duration of 50 epochs (or number of boosting rounds for XGBoost), rather than using early stopping, to ensure fair comparison of convergence behavior. To test the significance of performance differences across models, we applied Friedman's test at a 95% confidence level.
Prior to optimization, we quantified first- and higher-order statistics of each feature, such as mean, standard deviation, skewness (asymmetry), and kurtosis (tailedness). These moments informed our expectations for model learning challenges, such as handling heavy-tailed or skewed distributions.
To evaluate generalization, we applied a standard random split strategy, typically reserving 10–15% of the data as a hold-out test set, with the remaining data used for training and internal validation during hyperparameter optimization. In cases where the dataset was small, the validation and test sets occasionally overlapped, serving as a pragmatic compromise to assess performance. K-fold cross-validation was subsequently applied using the best hyperparameters to rotate the test split, offering a more comprehensive assessment and reducing variance in performance estimates.
We note that while SMAPE is valuable for normalizing errors relative to the combined magnitude of predictions and true values, it introduces slight asymmetry that can favor over-predictions. To complement accuracy-focused metrics, we further assessed optimization efficiency and stability through average convergence rates, calculated as the mean relative improvement in loss across successive trials, highlighting how quickly each model approached its best performance. Table 1 summarizes the formal definitions of these metrics.
By comparing hyperparameter importance profiles across models and datasets, we could directly connect these insights back to our earlier statistical characterization (skewness, kurtosis, and combined complexity measure |skew| + |kurt − 3|), illustrating how data distribution properties impacted the relative tuning needs of different architectures.
Table 2 summarizes key first-order statistics, highlighting skewness and kurtosis as important factors influencing model performance. For instance, the Cobalt creep resistance at 1300 °C in the ATLAS-RHEA dataset exhibits extreme skewness (39) and kurtosis (1500), reflecting a highly non-normal distribution with substantial outliers. These properties underscore the necessity for targeted pre-processing techniques, such as log or quantile transformations. Additionally, robust models capable of managing extreme outliers may be required. In contrast, features with lower skewness and kurtosis, such as yield strength 1000 °C or pugh ratio, display distributions closer to normality, making them suitable for direct use in training without extensive preprocessing. These features are less likely to cause instability in the trained models and typically contribute to more reliable and consistent predictions. Features with moderate skewness and kurtosis, such as Scheil LT and Kou criteria, suggest some degree of asymmetry and outliers, which could still impact model performance if not accounted for during pre-processing.
Feature | Mean | Std. dev. | Min | Max | Median | 25% | 75% | Skewness | Kurtosis | Complexity |
---|---|---|---|---|---|---|---|---|---|---|
Nb | 0.22 | 0.19 | 0.00 | 0.95 | 0.20 | 0.05 | 0.35 | 0.909 | 3.289 | 1.20 |
Cr | 0.21 | 0.18 | 0.00 | 0.95 | 0.15 | 0.05 | 0.30 | 0.955 | 3.458 | 1.41 |
V | 0.22 | 0.19 | 0.00 | 0.95 | 0.20 | 0.05 | 0.35 | 0.940 | 3.400 | 1.34 |
W | 0.14 | 0.11 | 0.00 | 0.90 | 0.10 | 0.05 | 0.20 | 0.699 | 3.143 | 0.84 |
Zr | 0.21 | 0.18 | 0.00 | 0.90 | 0.15 | 0.05 | 0.30 | 0.915 | 3.300 | 1.21 |
YS 1000 °C | 1170.58 | 637.06 | 0.00 | 3399.29 | 1111.17 | 673.36 | 1599.91 | 0.478 | 2.742 | 0.74 |
EQ 1273 K THCD (W mK−1) | 14.11 | 12.95 | 0.02 | 62.92 | 9.27 | 3.43 | 22.70 | 0.974 | 3.009 | 0.98 |
EQ 1273 K density (g cm−3) | 9.59 | 1.77 | 6.05 | 16.94 | 9.33 | 8.29 | 10.65 | 0.773 | 3.563 | 1.34 |
1300 min creep CB (1 s−1) | 0.00 | 0.10 | 0.00 | 4.01 | 0.00 | 0.00 | 0.00 | 39.012 | 1524.244 | 1560.26 |
PROP 1500 °C CTE (1 K−1) | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.124 | 3.353 | 1.48 |
YS 1500 °C PRIOR | 807.87 | 497.41 | 0.00 | 2689.29 | 738.53 | 415.22 | 1127.56 | 0.656 | 2.995 | 0.66 |
Pugh ratio PRIOR | 2.60 | 0.51 | 1.45 | 4.52 | 2.56 | 2.23 | 2.92 | 0.390 | 2.914 | 0.48 |
SCHEIL LT | 1620.39 | 163.87 | 1507.01 | 2353.37 | 1541.34 | 1529.62 | 1599.97 | 1.954 | 5.858 | 4.81 |
Kou criteria | 353.63 | 402.90 | 0.13 | 4656.12 | 240.34 | 62.72 | 463.10 | 2.184 | 9.764 | 8.95 |
Creep merit | 77![]() |
431![]() |
−825.33 | 10![]() ![]() |
507.71 | 40.83 | 6176.66 | 11.793 | 199.473 | 208.27 |
The BIRDSHOT dataset40 currently contains detailed information on 147 non-equimolar Cantor high-entropy alloys (HEAs), focusing on their composition, processing parameters, and mechanical properties. Each alloy is characterized by its elemental fractions (Al, Co, Cr, Cu, Fe, Mn, Ni, V) and evaluated for key properties such as YS, ultimate tensile strength (UTS), tension elongation, and hardness. The dataset also includes computationally derived parameters such as stacking fault energy, valence electron concentration, and the Pugh ratio, providing insights into the alloys' mechanical behavior and stability. For example, the yield strength of the alloys ranges from 310 MPa to 537 MPa, while tension elongation varies from 18.3% to 25.7%. First-order statistics for this dataset is summarized in Table 3.
Feature | Mean | Std. dev. | Min | Max | Median | 25% | 75% | Skewness | Kurtosis | Complexity |
---|---|---|---|---|---|---|---|---|---|---|
Al | 1.68 | 2.34 | 0.00 | 15.00 | 0.00 | 0.00 | 4.00 | 1.564 | 8.058 | 6.62 |
Co | 18.57 | 14.24 | 0.00 | 75.00 | 16.00 | 8.00 | 25.00 | 1.014 | 4.235 | 2.25 |
Cr | 7.84 | 6.32 | 0.00 | 25.00 | 8.00 | 4.00 | 10.00 | 0.875 | 3.225 | 1.10 |
Cu | 1.63 | 3.88 | 0.00 | 24.00 | 0.00 | 0.00 | 2.00 | 3.531 | 16.961 | 17.49 |
Fe | 15.07 | 11.89 | 0.00 | 75.00 | 15.00 | 5.00 | 20.00 | 1.686 | 7.613 | 6.30 |
Mn | 5.33 | 8.10 | 0.00 | 40.00 | 0.00 | 0.00 | 8.00 | 1.602 | 5.256 | 3.86 |
Ni | 39.28 | 14.23 | 0.00 | 75.00 | 40.00 | 31.00 | 50.00 | −0.521 | 3.051 | 0.57 |
V | 10.59 | 8.17 | 0.00 | 30.00 | 10.00 | 4.00 | 15.50 | 0.545 | 2.252 | 1.29 |
Yield strength (MPa) | 388.41 | 125.27 | 176.82 | 790.00 | 367.00 | 282.19 | 460.50 | 0.630 | 2.802 | 0.83 |
UTS true (MPa) | 952.40 | 239.12 | 333.00 | 1581.00 | 958.00 | 802.50 | 1096.62 | 0.012 | 2.867 | 0.15 |
Elongation (%) | 33.65 | 12.12 | 0.00 | 55.70 | 34.00 | 26.10 | 42.25 | −0.627 | 3.307 | 0.93 |
Hardness (GPa) SRJT | 2.52 | 0.67 | 1.63 | 6.04 | 2.39 | 2.09 | 2.77 | 2.516 | 12.689 | 12.21 |
Modulus (GPa) SRJT | 196.84 | 22.44 | 141.59 | 263.51 | 193.38 | 180.73 | 212.70 | 0.216 | 2.798 | 0.42 |
Avg HDYN/HQS | 1.15 | 0.04 | 1.06 | 1.25 | 1.15 | 1.13 | 1.17 | −0.107 | 3.008 | 0.11 |
Depth of penetration (mm) FE | 2.86 | 0.27 | 2.22 | 3.46 | 2.87 | 2.66 | 3.05 | −0.140 | 2.429 | 0.71 |
Despite its comprehensive nature, the dataset includes minor missing data across key properties. Measured properties like YS and UTS have 141 complete entries (∼96%), while computed parameters such as stacking fault energy, valence electron concentration, and hardness are available for 131 samples (∼89%). While the dataset remains highly usable, handling these missing values through imputation or focusing on well-reported features will be critical for accurate analysis and predictive modeling. The dataset contains 22 features, capturing both experimental and computational results, and is suitable for developing predictive models, investigating structure–property relationships, and advancing the discovery of novel high-performance alloys.
The Multi-Principal Element Alloy (MPEA) dataset41 contains information on 1545 alloys across 23 columns, focusing on chemical composition, microstructure, processing methods, and mechanical properties. Key attributes include alloy formula, microstructure classification (e.g., FCC, BCC, or mixed phases), grain size, processing methods (e.g., casting), and mechanical properties such as hardness (HV), YS, UTS, elongation, and Young's modulus. The dataset also includes test conditions (e.g., temperature, test type), computationally derived parameters (e.g., calculated density and Young's modulus), and bibliographic details such as DOI references and publication metadata. The first-order statistics about this dataset is summarized in Table S3.
The dataset exhibits significant variability in data completeness, with properties like grain size (∼15%), hardness (∼34%), and elongation (∼40%) moderately reported, while others, such as experimental density (∼7%) and Young's modulus (∼9%), are highly sparse. Elemental contamination data (e.g., oxygen, nitrogen, and carbon content) are particularly limited, with only 57, 45, and 4 entries, respectively. These data gaps may introduce biases and limit usability for certain analyses. Well-reported features like yield strength (∼69%) and processing methods can be prioritized for modeling, while missing values in moderately sparse features may be addressed using imputation. Highly sparse features, such as carbon content, should be treated cautiously or excluded from predictive tasks to ensure result integrity.
Table 4 highlights the tailored hyperparameter configurations across models. To reduce the hyperparameter search space and streamline optimization, the number of layers in both the encoder and decoder is initially fixed. This strategy is effective when a baseline architecture demonstrates satisfactory performance. As model development advances, relaxing this constraint allows the architecture to adapt to data-specific complexity and potentially enhance generalization. This flexibility becomes particularly important when optimizing models for datasets with varying feature dimensions and complexity, such as BIRDSHOT and MPEA. As Table 4 shows the FDN-R and DNF employ smaller learning rates (∼10−4) compared to the higher rates used by TabNet and XGBoost (∼10−2). Batch sizes also vary, with DNF using smaller batches (32), while VAE benefits from larger batches (128). Regularization terms (λ) are applied in FDN-R, DNF, and VAE to prevent overfitting. Adam is the preferred optimizer for most models, except for TabNet, which benefited using RMSprop. Additionally, architecture-specific parameters like conjunction units (112) are critical for DNF performance.
Study name | Layers | Latent dim | Drop-out rate | Learning rate | Optimizer | Batch size | Epochs | Additional hyperparameters |
---|---|---|---|---|---|---|---|---|
Regularized FDN-R | Fixed | 192 | 0.1 | 1.26 × 10−4 | adam | 96 | 50 | λ = 3.69 × 10−6, α = 0.0164 |
DNF | 64 | 0.1 | 5.39 × 10−4 | adam | 32 | 50 | n_conj = 10, conj_units = 112 | |
CNN | 64 | 0.1 | 3.77 × 10−4 | adam | 32 | 50 | Kernel size = 4 | |
TabNet | — | — | 4.24 × 10−3 | RMSprop | 72 | 50 | nd = 24, na = 32, λ = 7.98 × 10−4 | |
VAE | 16 | 0.2 | 3.58 × 10−4 | adam | 128 | 50 | λ = 2.87 × 10−5, α = 0.045 | |
XGBoost | — | — | 9.58 × 10−2 | — | — | — | n_estimators = 50, max depth = 10 |
Fig. 2b illustrates the target-wise hyperparameter optimization history for the BIRDSHOT HEA dataset using the FDN-R model. Rather than optimizing across multiple models, the focus here is on individual target properties, such as yield strength, ultimate tensile strength, elastic modulus, hardness, elongation, and the HDYN/HQS ratio, each treated as a distinct regression task. The objective value, SMAPE, exhibits different convergence behaviors depending on the target, highlighting the varying difficulty and noise levels associated with predicting each property. This target-based optimization approach is more appropriate in this context, as it allows fine-tuning of the FDN-R model architecture and learning parameters to the specific statistical characteristics of each property, thereby maximizing predictive performance due to small nature of this dataset and ensuring model generalization across diverse features.
Fig. 3 shows the hyperparameter importance for all investigated models. The hyperparameter importance quantifies the average influence of each hyperparameter on the optimization objective (loss), estimating how much the loss increases when that hyperparameter is varied across its sampled range while others are held near optimal. Each subplot shows the relative importance of various hyperparameters in determining model performance, providing insights into which parameters have the greatest impact on the objective value during optimization. For example, the DNF and FDN-R models (Fig. 3a and b) highlight distinct hyperparameters as the most influential, whereas the CNN and VAE models (Fig. 3c and d) demonstrate different sensitivity patterns. The variation in hyperparameter importance across models underscores the unique tuning needs of each architecture.
Among the hyperparameters tuned, the optimizer choice often emerged as particularly influential, likely due to its direct role in controlling how model weights are updated during training. Optimizers like Adam and RMSprop can significantly influence convergence speed, training stability, and the quality of the final solution by adapting learning rates and managing gradients effectively. This adaptability often makes the optimizer more impactful than other hyperparameters like dropout rate or batch size, which primarily regulate model complexity or regularize training without fundamentally altering the learning process.
For FDN-R architecture, the type of optimizer (Adam, SGE, Adadelta) has significant impact on the model performance, contributing 100% to the objective value's reduction. If we fix the optimizer to adam, the most important hyperparameters are the latent dimension contributing 75%, learning rate contributing 11%, and negative slope of activation function contributing 6%, and the rest of the parameters show minor importance. The FDN-R model shows consistent improvement across multiple runs with the best hyper-parameters, with progressively lower validation losses. Friedman's test reveals a significant difference in performance between the runs, with a test statistic of 191.176 and a p-value of 1.11 × 10−18, indicating that the observed performance variations are statistically meaningful. This suggests that factors such as random initialization or data splits may influence the model's performance, warranting further investigation to enhance consistency. Alternatively, for VAE, the results of Friedman's test, with a test statistic of 28.70 and a p-value of 0.052, suggest that there is no statistically significant difference in the performance of the model across multiple training runs at a 95% confidence level. Although the p-value is close to the threshold of 0.05, it is slightly higher, indicating that we cannot reject the null hypothesis of no significant difference. This implies that the variations observed in the validation loss across the runs are likely due to random chance rather than meaningful differences in the model's performance.
The enclosed table in Fig. 3 offers key insights into model performance, highlighting variations in optimization efficiency, test R2 values, and average convergence rates. We define the convergence rate as the average relative improvement in the objective value across successive trials, calculated by , where Li denotes the loss at trial i. A larger magnitude of this metric (appearing as more negative in our plots, reflecting sharper decreases) indicates faster reductions in loss, typically when the hyperparameter search discovers narrow optimal regions. While such behavior accelerates convergence, it may also imply increased sensitivity to specific hyperparameter settings, which could impact both the robustness of optimization and the reproducibility of scientific results. In many high-throughput or industrial settings, consistent and stable model behavior, without extensive hyperparameter retuning, is essential to maintain both operational efficiency and scientific precision. Models that depend heavily on finely tuned parameters, reflected by large convergence jumps, may therefore be less suitable for rapid deployment pipelines.
Among the models evaluated, XGBoost emerges as particularly well-suited for high-throughput or industrial applications. It achieves a competitive loss value (0.00327) and solid test set R2, while demonstrating the shortest trial duration (0.63 seconds) and a moderate convergence rate (−75.2%) that suggests reliable optimization without reliance on extremely narrow hyperparameter configurations. In contrast, although the DNF model attains the lowest loss (0.00316) and highest test R2 (0.962), its steep convergence trajectory (−232.7%) and greater sensitivity to hyperparameter selection could complicate use in workflows that demand reproducibility and minimal tuning. TabNet and VAE, while offering valuable features such as interpretability or probabilistic flexibility, are characterized by longer trial durations and less stable convergence behavior, which may limit their practicality in time-sensitive applications. These insights help guide model selection by emphasizing the trade-offs between predictive performance, optimization stability, computational demands, and the need for scientific accuracy across repeated experiments.
This relationship is evident across the models evaluated. The DNF model stands out with the lowest loss value (0.00316) and the highest test R2 (0.962), demonstrating strong predictive capabilities. However, its large magnitude of convergence rate (−232.7%) combined with variability across optimization trials suggests greater sensitivity to hyperparameter choices, which may complicate its application in high-efficiency contexts. XGBoost, with a slightly higher loss (0.00327), exhibits exceptional computational efficiency, achieving the shortest trial duration (0.63 seconds), making it well-suited for large-scale datasets where time is a critical factor. Despite its speed, XGBoost's convergence rate (−75.2%) still indicates moderate optimization variability over extended trials. TabNet, while competitive in loss value (0.003586), shows a substantially longer trial duration (152.17 seconds) and a convergence rate of −95.0%, suggesting that its attention mechanism, although valuable for interpretability, adds computational overhead. The VAE, while achieving reasonable test R2 scores, faces challenges with prolonged trial durations and less consistent convergence behavior, indicating inefficiencies in both computational cost and optimization robustness. These findings suggest that DNF and XGBoost are particularly promising for applications requiring a balance of predictive accuracy and efficiency, whereas models such as VAE and TabNet, with their longer runtimes and sensitivity to optimization, may be better suited for exploratory or interpretability-focused studies. These insights help guide model selection by highlighting the trade-offs between predictive performance, stability, and computational demands.
The main observation in Fig. 4 is that as both model complexity and feature complexity increase, predictive performance varies significantly across models. While XGBoost (a) performs well for relatively simple features such as yield strength and density, its performance deteriorates for the most complex feature, 1300 Min Creep CB, indicating limited generalization to highly skewed or outlier-heavy data. In contrast, the encoder–decoder model based on regularized FDN-R architecture maintains strong performance across all features and demonstrates superior generalization on the most complex output, achieving lower error metrics and higher alignment with the parity line. The 1D CNN, despite being the most complex model, fails to consistently improve performance and struggles with both simple and complex features, suggesting that higher model complexity does not always translate to better generalization, especially for skewed or outlier-prone data. This highlights the importance of appropriately balancing model complexity with feature distribution characteristics.
For features like YS 1000 °C and density, both exhibiting low feature complexity, models such as XGBoost, regularized FDN-R, and DNF-net show strong predictive performance with low error metrics and high logR2 values. However, as feature complexity increases with Kou Criteria and 1300 Min Creep CB, model behavior diverges. Among all models, the regularized FDN-R shows the highest robustness and generalization to this severe skewness, achieving an R2 of 0.856 and SMAPE of 56.8% on the Creep feature, outperforming all other methods including XGBoost and VAE. This indicates that the regularized FDN-R's architectural simplicity combined with its ability to regularize through sparsity allows it to handle extreme data distributions effectively, even outperforming models that typically exhibit higher expressiveness.
While models like CNN begin to struggle with Kou Criteria and Creep due to the presence of extreme values and long tails in the distribution, the regularized FDN-R and DNF-net encoder–decoder models retain relatively stable performance across all feature complexities. XGBoost and VAE models also show notable resilience across the board, particularly on skewed features, though it slightly underperforms the regularized FDN-R for Creep. In contrast, 1D CNN model exhibit sharp performance degradation on highly skewed features, highlighting the limitations of more complex or generative architectures in handling rare or extreme data cases without additional regularization or specialized training strategies.
Skewed distributions, especially with long tails, pose significant challenges for regression models. They can violate assumptions such as residual normality, introduce leverage points, and cause heteroscedasticity ultimately degrading model accuracy. These effects underscore the importance of data transformations and robust model selection tailored to feature distribution properties.
Table 5 summarizes the performance of various models in predicting the features analyzed in Fig. 4, with the addition of thermal conductivity (THCD) at 1273 K, a critical property often used as a target or constraint in materials optimization. Results are presented as Mean Squared Error (MSE) alongside R-squared (R2) values, reflecting each model's capacity to capture variance across features. XGBoost demonstrates strong overall performance, particularly for less complex features, but its accuracy declines for highly complex properties such as creep. Encoder–decoder models like FDN-R and DNF-net show comparable or superior performance on certain features, especially under higher complexity, while more expressive models such as 1D CNN exhibit inconsistent results, often struggling with extreme values. Notably, the VAE maintains stable predictive performance across increasing feature complexity, as evidenced in Fig. S4 by relatively aligned parity plots even for challenging targets like creep. This suggests that incorporating latent probabilistic representations can help the model capture distributions with heavy tails or multi-modality. Overall, while XGBoost is generally robust, the regularized FDN-R and the VAE show better adaptability to complex features, highlighting the importance of balancing model architecture with feature distribution characteristics.
Model-feature name | YS 1000 C | 1273 K density | 1273 K THCD | Kou criteria | 1300 °C creep CB |
---|---|---|---|---|---|
Skewness | 0.48 | 0.77 | 0.97 | 2.2 | 39 |
Complexity | 0.74 | 1.34 | 0.98 | 8.95 | 1560 |
XGBoost | 1613 (R2: 0.995) | 0.141 (R2: 0.920) | 17.70 (R2: 0.936) | 2.61 × 104 (R2: 0.854) | 0.111 (R2: −3.06 × 1021) |
Encoder–decoder (DNF-net) | 8949 (R2: 0.976) | 0.2755 (R2: 0.902) | 15.43 (R2: 0.913) | 2.59 × 104 (R2: 0.855) | 1.19 × 10−23 (R2: 0.671) |
Encoder–decoder (FDN-R) | 5443 (R2: 0.985) | 0.2106 (R2: 0.925) | 14.56 (R2: 0.918) | 1.93 × 104 (R2: 0.892) | 8.32 × 10−24 (R2: 0.771) |
TabNet | 1210 (R2: 0.969) | 0.2501 (R2: 0.919) | 13.73 (R2: 0.917) | 1.36 × 105 (R2: 0.161) | 2.58 × 10−23 (R2: 0.403) |
Encoder–decoder (1D CNN) | 4348 (R2: 0.881) | 0.4662 (R2: 0.833) | 22.76 (R2: 0.872) | 4.48 × 105 (R2: −1.5) | 2.91 × 10−23 (R2: 0.201) |
Encoder–decoder (VAE) | 1115 (R2: 0.969) | 0.2482 (R2: 0.911) | 19.53 (R2: 0.889) | 1.64 × 104 (R2: 0.908) | 1.14 × 10−23 (R2: 0.683) |
Applying quantile transformations alone (Fig. 5b and d) improves training accuracy but fails to preserve this performance when outputs are inverse-transformed, resulting in a noticeable gap between predictions and the ground truth on the original scale. Although, uniform quantile transformation greatly improves the prediction accuracy. In contrast, combining normal quantile transformation with MinMax scaling (Fig. 5c) enhances model generalization across the output range which is comparable with uniform quintile transformation alone. This two-step normalization technique enables better distribution matching and numerical stability during training, which translates into more accurate predictions post-inversion.
Among the methods tested, MinMax scaling alone (Fig. 5e) yields the highest R2 and lowest MSE on the original scale, indicating that for non-complex (moderately skewed or bounded) targets with narrow range, simpler transformations may suffice. The results suggest that normalization prior to training, especially when followed by an appropriate inverse transformation, plays an important role in ensuring the FDN-R model's robustness and generalization. MinMax and quantile-based strategies stand out as the most effective preprocessing methods, providing stable learning dynamics and superior performance in extrapolating complex material properties.
In addition to advanced pre-processing pipelines, we also explored TabNet, an attention-based model for tabular data. As shown in the second row from the bottom in Fig. S4, TabNet achieves competitive performance for moderately complex features such as density and thermal conductivity but begins to show limitations with the Creep CB, the most skewed feature. While its use of attentive feature selection allows it to prioritize important signals and maintain relatively strong predictions on moderately non-Gaussian distributions, its performance deteriorates on severely skewed data, consistent with models that lack explicit handling of tail behavior. This suggests that while TabNet's inductive bias is useful for structured data, additional architectural or training modifications are necessary for generalizing to extreme outputs like creep. Overall, the results underscore the importance of aligning model choice with feature complexity.
In the BIRDSHOT dataset, which focuses on non-equimolar Cantor HEAs with moderately skewed mechanical properties (e.g., yield strength, elongation, hardness), both XGBoost and the regularized FDN-R maintained strong predictive performance. However, the DNF-Net demonstrated superior resilience to moderate outliers and class imbalance in features such as hardness (skewness = 2.5, kurtosis = 10), outperforming XGBoost in terms of both R2 and SMAPE. For the MPEA dataset, where feature completeness and measurement consistency varied across properties, encoder–decoder models, particularly the regularized FDN-R, showed strong generalization despite the absence of hyperparameter tuning. Without task-specific optimization, the FDN-R model produced high-quality predictions for key properties such as ultimate tensile strength and calculated density, with parity plots confirming alignment between predicted and observed values.
The generalization trends indicate that model architecture must be matched not only to the statistical characteristics of the target feature but also to the broader dataset conditions such as sample size, missingness, and feature granularity. The overcomplete encoder–decoder models with Regularized FDN-R and DNF-Net architectures offer a promising balance between expressiveness and robustness, while simpler models like XGBoost remain effective baselines in small or sparse datasets.
Beyond accelerating exploratory research, the GUI serves as an effective educational platform. It enables graduate students and interdisciplinary collaborators to experiment with data transformations, model architectures, and training regimes in a transparent, hands-on manner, reinforcing foundational machine learning concepts without requiring extensive programming expertise. This lowers the barrier for adoption of advanced encoder–decoder techniques in materials science, supporting a broader community of practitioners aiming to leverage data-driven strategies.
Importantly, the software is designed to be modular and extensible, providing a foundation for future integration of uncertainty quantification, multi-objective optimization, and autonomous decision-making workflows. By openly sharing this tool, we aim to promote collaborative exploration and accelerate the application of robust machine learning approaches to the discovery and design of advanced materials systems.
Our results indicate that these neural architectures can rival and, in some cases, surpass traditional models in accuracy, particularly for complex distributions, provided that appropriate data scaling and regularization techniques are applied. Regularized FDN-R models showed robust generalization across varying feature complexities, while DNF-Nets offered a favorable trade-off between interpretability and performance. Conversely, models with higher capacity, such as CNNs, were more sensitive to outliers and skewed data, highlighting the importance of matching architecture complexity to feature distribution.
Beyond predictive tasks, encoder–decoder frameworks offer exciting potential for downstream applications in materials discovery pipelines, including real-time optimization, synthetic data generation for privacy preservation, and closed-loop predictions acting as machine learning priors in Bayesian discovery of materials. Future progress will depend on addressing challenges around convergence stability, uncertainty quantification, and computational efficiency for small datasets.
In summary, while no single model universally outperforms across all scenarios, hybrid and interpretable deep learning approaches, when thoughtfully applied, can enhance the accuracy, adaptability, and utility of machine learning in materials science, particularly for complex applications.
The Supplementary Information includes additional details on model architectures (DNF-Net, TabNet, and VAE), training procedures, hyperparameter configurations, and evaluation metrics used in this study. It also provides extended results for additional datasets (BIRDSHOT and MPEA), and supplementary figures and tables supporting the main findings. Supplementary information is available. See DOI: https://doi.org/10.1039/d5dd00166h.
This journal is © The Royal Society of Chemistry 2025 |