Krishnaraj
Kundavu
,
Suman
Mondal
and
Amrita
Bhattacharya
*
Ab initio Computational Materials Science Laboratory, Department of Metallurgical Engineering and Materials Science, Indian Institute of Technology, Bombay, Maharashtra 400076, India. E-mail: b_amrita@iitb.ac.in
First published on 9th August 2023
Scanning the potential energy surface of a given compositional space via Ehull analysis is not sufficient to comment on thermodynamic stability, since the contribution stemming from the vibrational free energy is typically ignored in high-throughput searches of compositional spaces for stable compounds. The calculation of the vibrational free energy through first principles can be computationally very expensive owing to the complexity of the structures, which is directly proportional to the number of symmetrically non-unique terms to be evaluated for the creation of the dynamical matrix. In this work, we use machine learning (ML) to predict the free energy of a given compositional space (ternary perovskite compounds belonging to different symmetric structures) using the elemental and structural descriptors as fingerprints. The temperature dependence of the free energy is modeled using a 3rd-order polynomial fit, where the coefficients are learned and predicted using ML. Thereby, a highly accurate model is built for the zero-point energy (with a root mean square error (RMSE) of 18.9 meV per atom), which is further improved by employing a symbolic regression technique, SISSO, giving a very low RMSE of 8 meV per atom. This model, while providing a computationally inexpensive means for predicting the harmonic vibrational free energy of compounds, also provides an aid to obtain the free energy and hence assess the thermodynamic stability of a given composition at any temperature. This work also provides important insights on how the elemental and compound properties are related to the vibrational free energy and hence, may aid in its prediction.
Several groups have attempted to explore this problem.24–29 However, it is not a trivial one, owing to the huge compositional space of perovskites. Ideally, a perovskite compound (chemical formula ABX3) will have a cubic unit cell with A and B cations occupying the center and corners of the cube, respectively, and X anions occupying the edge centers to form octahedra around the B cations. However, not all ABX3 compounds can be realized in an ideal perovskite structure. Distortions are created in the octahedra depending on the ionic radii of the cations and anions.30 The realization of ABX3 compounds in the perovskite structure has been related to the ionic radii of the constituent elements. In this regard, Goldschmidt31 gave an empirical parameter, t, for realization of the ideal perovskite structure:
![]() | (1) |
Data-driven methods can be used as an alternative aid to solve similar problems and gain further insight. These methods can be applied on a data set containing ab initio results of a given physical property under consideration (called the target property) and some relatively less complex and physically meaningful properties (called descriptors). The descriptors can be used as fingerprints to explain and analyze the target property. These methods, while aiding in the discovery of new materials with improved properties, allow prediction of new trends, hidden traits, anomalies, etc. Naturally, enormous computational effort has been spent to build online repositories containing different physical properties of compounds, which may be used to build meaningful machine-learning (ML) models.32–42 Many works have been carried out in this regard. ML models are being built to predict the inter-atomic potentials to circumvent the problems associated with the use of DFT in predicting various properties of materials, which may be computationally expensive.43,44 While these methods are yet to be tested for a wide variety of compound classes, researchers are using ML to predict the stability of compounds. For instance, tolerance factors, ionic radii, bond distances, etc., have been used to build an ML model to categorize perovskites from nonperovskites.45,46 ML models have also been built to predict the crystal structure based on the tolerance factor.24 Structure maps have been developed based on the bond lengths of A and B cations with X anions, and formability has been predicted using the tolerance factor.28 Bartel et al.47 defined a new tolerance factor, by considering multiple A- and B-site cations in double perovskites. They used SISSO,48 which is one of the compressed-sensing techniques, for formulating this new factor. Xu et al.49 built a machine-learning model using the data available in the Materials Project database32 to identify formable single and double B-cation perovskites. Similar to formability, machine-learning models have also been built to study the thermodynamic stability of perovskites.50 Li et al.51 predicted the thermodynamic phase stability of ABO3 compounds based on convex hull analysis. Balachandran et al.52 used the ionic radii of elements to build an ML model for predicting new cubic perovskite materials using an experimental database of 390 perovskite compounds. Talapatra et al.53 studied the overlap between the formable and thermodynamically stable perovskites predicted using machine-learning models, and predicted around 300 new compounds that are stable in their perovskite structures.
All these works provide some critical insights for the prediction and realization of new perovskite compounds. However, the vibrational (or thermal) free-energy content of a composition, at any given temperature, plays a very crucial role in determining its thermodynamic phase stability. The Ehull analysis reported in most literature focusing on high-throughput studies, however, does not take this into account. Hence, the energetically most stable phase, as concluded, may still not be the most stable phase thermodynamically. To completely define the thermodynamical stability, one needs to incorporate the vibrational free energy of the compound at constant volume (Helmholtz free energy, FH) and at constant pressure (Gibbs free energy, FG). The vibrational free energy can be estimated from the phonon frequency of the vibrational modes of a given composition. Using ab initio DFT-based methods, one can either employ finite displacement or perturbative approaches to calculate these phonon frequencies. However, the procedure is non-trivial, since it involves several force calculations for the displaced or the perturbed structures. The exact number of calculations that need to be performed depends on the symmetry of the considered unit cell. The lower the symmetry of the crystal, typically the larger the number of force calculations that are required for constructing the dynamical matrix.54,55 Hence, most of the materials databases do not contain the vibrational properties of the compounds. Few research groups have tried to predict free energy using ML models with their own databases. Legrain et al.56 built ML models for the vibrational free energy and entropy of 292 compounds from Inorganic Crystal Structure Database (ICSD) entries in aflow.org repositories. They used the elemental descriptors to predict various thermal properties. They achieved an RMSE of 18.76 meV per atom for the vibrational free energy. Bartel et al.57 used the SISSO (sure independence screening and sparsifying operator) approach to predict the experimental Gibbs free energy of inorganic compounds. Yoon et al.58 used adaptive learning techniques to build a generalized ML model for the Gibbs free energy of 40000 ICSD compounds. However, these works focus on the generalization of models to predict the Gibbs free energy, ignoring the temperature-dependent phase transition, which cannot be represented by just the chemical composition. Building ML models by incorporating inherent features of a compound class is required to solve this problem.
We, therefore, perform harmonic vibrational calculations on a set of perovskite compounds from the ICSD database with Ehull ≤ 80 meV. We first classify them as vibrationally stable (with all real modes) or unstable (with large phonon instabilities) by analyzing their harmonic phonon spectrum. A considerable number (∼32%) of compounds lying at the surface of the hull are found to be vibrationally unstable. We thus prepare a data set of 80 vibrationally stable compounds, along with their elemental and structural features. We employ one unique strategy to predict the variation of vibrational free energy, FH, as a function of temperature. We find that a third-order polynomial fit is adequate to represent the temperature variation of FH. Thus, we use ML to predict the coefficients of the polynomial fit. A flowchart illustrating the steps in our study is given in Fig. 1. Using our approach, a very accurate ML model is built to predict the variation of free energy with temperature. Thus, using only elemental and a few simple compound descriptors, the vibrational free energy of the compounds can be predicted, in a way that is fast, reasonably accurate and computationally inexpensive. Finally, we analyze the role of different descriptors in constituting the vibrational free energy.
![]() | ||
Fig. 1 Schematic diagram showing the criteria for screening of vibrationally stable perovskite compounds and construction of the data set containing the descriptors for machine learning. |
Since the compound descriptors may vary with different numerical settings used in the theoretical calculations, we extract them from the output of our DFT calculations (cf. Computational details). Thereby, 12 compound descriptors are collected (cf.Table 2), which are explained individually below.
1. The density, ρ, of the relaxed structures (in kg m−3).
2. Cohesive energy, Ecoh per f.u. (in eV), of the compound as calculated from its total energy (E[ABX3]) and that of the isolated atoms (taken as the reference chemical potential, viz. μA, μB and μX)
Ecoh[ABX3] = E[ABX3] − μA − μB − 3μX | (2) |
3. The tolerance factor (1) and octahedron factor (3), which are calculated using the Shannon ionic radii61 of the A, B, and X ions;
![]() | (3) |
4. The nearest-neighbor distances between various atoms in the unit cell of the relaxed structure are generalized and considered as a set of descriptors. To keep all descriptors identical for the cubic as well as non-cubic structures, the means (AB,
AX, and
BX) and standard deviations (σAB, σAX, and σBX) of the three non-unique bond distances (viz. A–B, B–X, and A–X) are extracted from the relaxed geometry.
FH = F(0)vib − TS | (4) |
FH = A × T3 + B × T2 + C × T + D | (5) |
![]() | (6) |
The RMSE indicates the variance of the residuals, which is given by:
![]() | (7) |
Category | Descriptors |
---|---|
Elemental | Atomic number (Z), atomic mass (M), period (P), and group (G) number of the constituent elements in the periodic table, first ionization energy (IEI), second ionization energy (IEII), electron affinity (EA), Pauling electronegativity (χP), Allen electronegativity (χA), van der Waals radius (rvdw), covalent radius (rcov), atomic radius (ratomic), melting point (MP), boiling point (BP), density (ρ), heat of fusion (ΔHfus), heat of vaporization (ΔHvap), thermal conductivity (κ) and specific heat (cv) |
Compound | Density (ρ), cohesive energy per f.u. (Ecoh), tolerance factor (t), octahedron factor (μ), and mean and standard deviation of neighbour distances (![]() ![]() ![]() |
The thermal properties of the vibrationally stable compounds are further calculated. The phonon spectrum (Fig. 2(a)) and thermal properties of BaLiF3 have been discussed as a representative case study. The Helmholtz free energy FH (kJ mol−1), entropy S (J K−1 mol−1) and specific heat at constant volume Cv (J K−1 mol−1) are plotted as a function of temperature T in Fig. 2(b). As already discussed in the methods section, the main motivation of our study is to learn the T dependence of FH. Fig. 2(c) shows the comparison of the 2nd- and 3rd-order polynomial fit for the Helmholtz free energy FH as a function of T. The 3rd-order fit (eqn (5)) ideally captures the variation. Hence, the four coefficients (A, B, C and D) of this fit are learned and predicted using ML.
Thus, the final data set of 80 perovskite compounds with 57 descriptors in total is obtained. As the first step, we analyze the feature correlation to eliminate the strongly correlated features. A high correlation between the descriptors may lead to overfitting of the model and hence, may decrease the accuracy of the model. The Pearson correlation coefficient heat maps for both elemental and compound descriptors are shown in Fig. 3. Pearson correlation coefficient between features is calculated as given below.
![]() | (8) |
![]() | ||
Fig. 3 Correlation heat map plotted for (a) descriptors of one of the elements, viz. the element B presented as a representative case study (the heat maps for elements A and X are provided as a part of the ESI†), and (b) compound descriptors of perovskites in our database. Descriptors with a correlation score of ±0.9 are concluded to be strongly correlated (symbols have their usual meaning as given in Table 1). |
Using the above correlation score, strongly correlated features (with correlation of >±0.9) are dropped. Thus, out of the correlated features, viz. atomic number, atomic mass, and period, only atomic mass is chosen based on it's highest feature importance. The higher importance of atomic mass may be related to the crucial role of mass in lattice vibration. Similarly, all three different radii are found to be correlated and thus, only the van der Waals radius of A and X elements is retained. Also, the Pauling electronegativity, boiling point, and heat of vaporization are removed, while retaining their significantly correlated counterparts i.e., Allen electronegativity, melting point, and heat of fusion, respectively. The correlation between compound descriptors is further calculated, which has been shown in Fig. 3(b). The compound features are found to be uncorrelated. After dropping all the correlated descriptors, a total of 52 descriptors are left with which the ML models are built.
ML is performed on the processed data set to build a regression model for the zero-point energy (ZPE), i.e. the coefficient D of eqn (5), of the compounds. A 10-fold cross-validation (CV) method is used for evaluating our models to make sure that there is no bias due to the small size of our data set. In the first step, the performance of models is compared to decide the scaling method that is to be used. Since the descriptors used in our models belong to different dimensions, their magnitude varies in different ranges and thus, they need to be scaled to improve the performance of our models. This is illustrated in Fig. 4(a), where the performance of the models for unscaled, standard scaled and min-max scaled datasets is compared. As can be observed in the figure, scaling improves the performance of the models considerably for GPR-1 and GPR-2. In particular, the performance of GPR-2 is found to be very poor for the unscaled data and hence it's R2 score is not shown in the figure. The standard scaler is chosen, which gives the best results (with GPR-2) among all the models. To understand the critical role of descriptors in the target property, the descriptors are classified into three groups, viz. (a) elemental, (b) compound, and (c) mixed. Subsequently, the R2 scores of the models built using the elemental, compound, and mixed descriptor sets are compared. The final set of 80 compounds and their compound descriptors used in our data set are listed in Tables S2 and S3 of the ESI.†Fig. 4(b) compares the performance of the chosen algorithms for the different descriptor sets. As can be seen from the bar chart, the combined set of descriptors yields the best performance, while the elemental descriptor set is a close second.
Once the descriptor set and scaling methods are decided, the number of descriptors required is optimized to build the best-performing model for the ZPE. The performance of various ML algorithms is compared for the set of top 50, 40, 30, 20, and 10 descriptors in our data set. The top descriptors are decided using the SelectKBest method provided by scikit-learn. The comparison of the R2 scores for different models is shown in Fig. 4(c). As can be observed, the GPR-2 algorithm gives the best results with a CV score of 0.95 with the top 30 features. Except for the LASSO algorithm, the performance of other models is found to be comparable.
The top 15 descriptors for describing the ZPE are shown with their importance score as per the SelectKBest method, which is shown in Fig. 4(d). The atomic number and electron affinity of the X atoms are found to be very important descriptors. This is mainly due to the wide difference in the range of ZPE that is observed depending upon the variation of X elements in the perovskite. Other important features are found to be the bond-distance averages and standard deviations, the radius of A and X elements, heat of fusion of the B element, the first ionization energy of the A atom, etc. These properties directly or indirectly contribute to the bond strength of the atoms in the crystal and hence to the vibrational frequencies of phonons.
A CV score of 0.95 is obtained, from which it can be concluded that our data set does not have a huge bias in the samples and hence the model is built by splitting the data set into test and train subsets. The train subset is used to build the ML model and the test set is used as the unseen data to validate the model’s performance. To decide the optimum size of the training subset, we compare the R2 score and RMSE as a function of the number of data samples in the train set. Both R2 score and RMSE saturate at a train-subset size of 64 samples. Hence, 64 out of 80 compounds are randomly selected for the train set using scikit-learn methods and the remaining 16 compounds are used for testing. With this data set, we obtain a best test R2 score of 0.97 and RMSE of 1.84 kJ mol−1 (18.9 meV per atom) for the zero-point energy, i.e. the coefficient D, which is comparable to the results of Legrain et al.56 The scatter plot of the testing set is shown in Fig. 5(d).
The above-mentioned approach is also used to build the ML models for coefficients A, B, and C of eqn (5) (as shown in Fig. 5(a)–(c)). Comparisons of cross-validation R2 scores of different algorithms for coefficients A, B, and C are shown in Fig. S3 of the ESI.† Fig. S4 in the ESI† depicts the important features in the models built for these coefficients. The GPR-2-based model worked best for coefficient A with an R2 score of 0.88 with 40 descriptors. For coefficient B, the maximum R2 score using GPR-2 is obtained as 0.93 with standard scaling and 40 top descriptors. For coefficient C, GPR-2 gave a best CV score of 0.93 with 30 top descriptors using standard scaling. The R2 scores obtained for these models using test–train splitting are found to be very high (i.e. 0.94).
To increase the accuracy of predictions and to gain further insight into the role of descriptors, SISSO48 is used. With this method, the descriptors can be reduced to highly important compressed dimensions obtained by combining the original descriptors. Details of the seven dimensions obtained from SISSO are listed in Table 2, and yield a very highly accurate model for the ZPE. By analyzing the top features, we observe that most descriptors found in these complex dimensions are also found to have good correlation with our target property. For the ZPE, SISSO gives a dimension with a correlation score as high as 0.96. This feature involves a square root relation to the atomic radius of B and average AX bond length. The bond distance and radii of the constituent elements dominate 6 out of the 7 important features given by the SISSO algorithm. Some exceptions are the electron affinities and the thermal features like the melting points of constituent elements. Using the non-linear dimensions from SISSO, a linear regression model is trained, which yields a maximum R2 score of 0.99 and RMSE of 0.74 kJ mol−1 (8 meV per atom). Fig. 6 shows the scatter plot of predicted vs. calculated ZPE of the perovskite compounds in our data set. The ML models built for the different coefficients are used to predict the FH of several unknown compounds that are not present in our original dataset. As a tentative case study, the Imma and Pmm phases of EuNbO3 are selected, which are known to show phase transition,63i.e., at room temperature, EuNbO3 exists in the orthorhombic (Imma) structure and undergoes phase transition to the cubic (Pm
m) structure at 460 K. To verify the capability of our model to predict the FH and the thermodynamic phase transition, the total energy, i.e., the sum of the DFT static energy (electronic and ionic) and free energy (predicted from the model), is plotted as a function of temperature for both the phases in Fig. 7. The phase transition can be seen at ∼450 K, which is highlighted by the crossover between the blue (Imma phase) and the green (Pm
m) curves. Similar verification is also performed for phase transitions within different phases of BaBiO3, KCaCl3, CsSrCl3 and LaAlO3, which are also found to be in agreement with the literature.64–67 Even in compounds (viz. ErMnO3 and EuZrO3) where phase transitions are not observed, the predicted stabilities of different phases are found to be in agreement with literature. The results for these compounds are given in the ESI,† and the inputs and codes used for these predictions are supplied in the github repository link included in the ESI.†
![]() | ||
Fig. 6 Calculated vs. predicted zero-point energy (ZPE) plotted using the 7 dimensions obtained from the SISSO algorithm. |
One of the main goals of our study is to understand the underlying science in the variation of the free energy and compound properties. By analyzing the correlation between the different descriptors and their importance in the ML models with good accuracy, further insights can be obtained. The analysis of the correlation between the descriptors and the target variables brings out some trends in the nature of the vibrational free energy. It is observed that for higher free-energy values, the 3rd-order variation of temperature, i.e., the contribution arising from coefficient A, is negligible. Also, an increase in the ZPE leads to a decrease in the curvature of the variation of the ZPE with temperature. This can also be observed in the bar charts in Fig. 8(a and b). The correlation of −0.99 between coefficients A and B can also be exploited to increase the prediction accuracy of models for coefficient A. Coefficient B (predicted from our model) can be incorporated as a descriptor to predict coefficient A.
A comparison of elemental descriptors shows that descriptors corresponding to the X-site element are more correlated to our target properties as compared to those corresponding to A- and B-site elements. Fig. 8(a) shows a clear trend in the variation of the ZPE (coefficient D) as the X-site element is varied. ZPE values decrease as we move down the periodic table within the same group (example: O → S → Se). Similarly, the ZPE decreases as we move to the right in the periodic table (example: N → O → F). This correlates with the variation in the atomic radius of these elements, which in turn plays a role in the bond formation in the perovskite structure. Elements at the A and B sites do not show such high correlation with our target properties (Fig. S6 and S7 in the ESI†). The atomic number, electron affinity and atomic radius of the X element have >0.7 correlation with all four coefficients in our study. Descriptors of element A also follow a similar trend, although with a lower correlation. Descriptors of element B show low correlation (<0.5) with the coefficients. Similarly, when comparing the correlation of elemental descriptors to the different coefficients, the element at site B becomes more prominent in determining coefficient D or the ZPE. These correlations are translated exactly to the feature importance for different ML models built in our study.
Comparing the correlation of compound descriptors to our target variables, it is more evident that compound descriptors have higher correlation to the coefficients of fit. In particular, mean bond lengths between the A and B elements and A and X elements are found to be very important. However the standard deviation between these bond lengths becomes less important to coefficients A, B and C as compared to D. Fig. 8(c) shows the scatter plot of ZPE as a function of mean A–X bond length, indicating that moderate correlation exists between these two. However, this correlation is enhanced by SISSO using non-linear operators to combine multiple descriptors, as shown in Fig. 8(d). The dependence of FH on the bond lengths is driven by its direct correlation with the force constant required to form the dynamical matrix, which depends on the strength of the bonds and hence, the bond lengths. Thus, this is directly reflected in the top descriptors obtained in our models.
![]() | (9) |
![]() | (10) |
Footnote |
† Electronic supplementary information (ESI) available: The data set and the codes and models used in this work are available in github repository and link is provided in the supplementary pdf file. See DOI: https://doi.org/10.1039/d3ma00216k |
This journal is © The Royal Society of Chemistry 2023 |