Suresh
Bishnoi
a,
R.
Ravinder
a,
Hargun Singh
Grover
a,
Hariprasad
Kodamana
*b and
N. M. Anoop
Krishnan
*ac
aDepartment of Civil Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India. E-mail: krishnan@iitd.ac.in
bDepartment of Chemical Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India. E-mail: kodamana@iitd.ac.in
cDepartment of Materials Science and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
First published on 15th December 2020
Among machine learning approaches, Gaussian process regression (GPR) is an extremely useful technique to predict composition–property relationships in glasses. The GPR's main advantage over other machine learning methods is its inherent ability to provide the standard deviation of the predictions. However, the method remains restricted to small datasets due to the substantial computational cost associated with it. Herein, using a scalable GPR algorithm, namely, kernel interpolation for scalable structured Gaussian processes (KISS-GP) along with massively scalable GP (MSGP), we develop composition–property models for inorganic glasses. The models are based on a large dataset with more than 100000 glass compositions, 37 components, and nine crucial properties: density, Young's, shear, bulk moduli, thermal expansion coefficient, Vickers’ hardness, refractive index, glass transition temperature, and liquidus temperature. We show that the models developed here are superior to the state-of-the-art machine learning models. We also demonstrate that the GPR models can reasonably capture the underlying composition-dependent physics, even in the regions where there are very few training data. Finally, to accelerate glass design, the models developed here are shared publicly as part of a package, namely, Python for Glass Genomics (PyGGi, see: http://pyggi.iitd.ac.in).
Since the ML methods are primarily data-driven predictions, the model's reliability is highly dependent on the available data. To this extent, Gaussian process regression (GPR),22 a nonparametric ML model, presents an excellent candidate. GPR employs a probabilistic approach which makes the inference on new data by learning the underlying distribution (mean and covariance) of the available data.22 Note that various problems in mechanics and materials science employ a probabilistic framework (including GPR and Bayesian inference) to estimate material parameters.9,23–27 It has been shown that for small datasets, GPR models are more suitable in comparison to NN models for providing accurate composition–property predictions along with its confidence intervals in oxide glasses.9 However, for large datasets available in materials science, training the conventional GPR, which has time and space complexity for a dataset of size n, is computationally prohibitive.
Herein, using scalable GPR algorithms, namely, kernel interpolation for scalable structured Gaussian processes (KISS-GP)28 and massively scalable Gaussian processes (MSGP),29 we address the challenge of developing reliable GPR models for predicting nine relevant properties of functional glasses: density, Young's, shear, and bulk moduli, thermal expansion coefficient (TEC), Vickers’ hardness, refractive index, glass transition temperature (Tg), and liquidus temperature. The models are developed based on a large dataset with more than 100000 glasses and 37 components. These are the largest models developed to predict composition–property relationships in inorganic glasses. We show that KISS-GP, along with MSGP, provides rigorous models for large datasets superior to the state-of-the-art deep neural network (DNN) models.21 Further, the models provide estimates of the uncertainty associated with the predictions, making these models more reliable and robust compared to DNN models. Overall, we show that the methodology presented here can be used for developing GPR models for problems with large training datasets. Finally, the models developed here are made available as part of a software package designed for accelerating glass discovery, namely, Python for Glass Genomics (PyGGi, see: http://pyggi.iitd.ac.in).
y = f(x) +ε; where ε ∼ N(0, σε2), and f ∼ GP(m(x), k(x,x′)) | (1) |
k(x, x′) = E[f(x) − m(x), f(x′) − m(x′)] | (2) |
In the GP literature, k(x,x′) is also termed as the kernel function of the GP. A widely used rationale for the selection of kernel function is that the correlation between any points decreases with an increase in the distance between them. Some popular kernels in the GP literature are
(3) |
(4) |
Suppose we have a set of test inputs X* for which we are interested in computing the output predictions. This would warrant sampling as a set of f* ≜ [f(x1*),…,f(xn*)], such that f* = N(0, K(X*, X*)) with the mean and covariance as
(5) |
By the definition of GP, the new and the previous outputs follow a joint Gaussian distribution as
(6) |
Mean(f*) = K(X*,X)(K(X,X) + σε2I)−1y | (7) |
Covariance(f*) = K(X*,X*) − K(X*,X)(K(X,X) + σε2I)−1K(X,X*) | (8) |
Eqn (7) and (8) are employed to make new predictions using the GPR. It should be noted that the experimental values itself may contain errors, which may not be normally distributed. For example, the errors in the liquidus temperature may predominantly be in one direction due to the kinetics involved with crystallization at temperatures below the liquidus temperature. The present model does not take into account such variations, and the noise in the experimental values is modeled as a normal distribution.
logp(y|θ)μ − [yT(Kθ + σ2I)−1y + log|Kθ + σ2I] | (9) |
However, evaluation of (Kθ + σ2I)−1y and Kθ + σ2I require O(n3) and O(n2) operations, respectively.
Approaches like the subset of regressors (SoR)30 and fully independent training conditional (FITC)31 are the traditional approaches that are used to scale the GP inference. Recently, Wilson et al.28 introduced a structured kernel interpolation (SKI) framework, which generalizes point estimate methods such as FITC and SoR for scalable GP inference. For instance, the kernel in the SoR approach, kSOR, is computed as
kSOR(x,x′) = kxUKUU−1KUxT | (10) |
where, kxU (size 1 × n), KUU−1 (size m × m), KUxT (size n × 1)are covariance matrices generated from the exact kernel k(x,x′) for a set of m inducing points [u1,…,um]. Under the SKI framework, the exact kernel is replaced with an approximate kernel for fast computation by modifying kSOR considering kxU ≈ WKUU, where W is an n × m matrix of interpolation, which is too sparse. Therefore eqn (10) can be rewritten as
kSOR(x,x′) ≈ KxUKUU−1KUxT ≫ WKUUKUU−1KUUWT = WKUUWT = KSKI | (11) |
This general approach to approximating GP kernel functions is the basic framework of SKI,40 which in turn reduces the computation expense considerably, up to O(n).
Mean(f*) ≈ W*K(U,U)WT(K(X,X) + σε2I)−1y | (12) |
This is done by approximating K(X*,X) employing SKI as given by eqn (11).36 Here, we have to pay attention to the fact that the term K(U,U)WT(K(X,X) + σε2I)−1y is pre-computed during training reducing the cost of online computations to O(1). In similar lines, predictive covariance is computed as
Covariance(f*) ≈ diag(K(X*,X*)) − diag({K(X*,X)(K(X,X) + σε2I)−1K(X,X*)}) | (13) |
The diagonal operator in eqn (13) is the consequence of the fact the kernel matrices are highly sparse in the non-diagonal directions. Covariance computations in eqn (13) can be further simplified utilizing SKI as follows
Covariance(f*) ≈ diag(K(X*,X*)) − W*diag({K(U,X)(K(X,X) + σε2I)−1K(X,U)}) | (14) |
Here, the term diag(K(U,X)(K(X,X) +σε2I)−1K(X,U)) can also be pre-computed,36 leading to the overall computational cost of evaluating the predictions reducing to O(1).
We train this dataset employing KISS-GP (see Methods) with hyperparametric tuning to develop optimized models. Note that we use KISS-GP28 with Lanczos variance estimates (LOVE),33 which significantly reduces the computational time and storage complexity (see Methods). Further, the prediction for high-dimensional data is carried out using MSGP. Due to the (1) nature of MSGP,29 computational resources associated with the prediction are independent of the size of the data, thus enabling faster predictions (see Methods for details). Fig. 2 shows the predicted values of density, Young's, shear, and bulk moduli, TEC, Vickers’ hardness, refractive index, Tg, and liquidus temperature, in comparison to the measured experimental values for the trained GPR model with KISS-GP and MSGP. Since there are significant overlapping points, a heat map is used in Fig. 2, wherein the respective coloring scheme represents the number of points associated with each property per unit area. Note that only the test set is plotted in the figure, although the R2 values associated with the training and validation set is provided. The inset related to each property shows the probability density of error in the prediction with a confidence interval of 90%. We observe that the R2 values for all the properties are equal to or greater than 0.8, suggesting a well-trained model. Further, the R2 values of the training, validation, and test set are comparable, thereby confirming the goodness-of-fit of the model.
Now, we demonstrate the key attractive features of the proposed GPR-based approach to provide the uncertainty associated with the prediction. First, we analyze standard deviation predicted by KISS-GP for the test dataset, that is, the dataset unseen by the model. Fig. 3 shows the histogram of the absolute values of standard deviations |σ| corresponding to the predictions for compositions in the test dataset. Note that the |σ| mentioned here is the standard deviation predicted by the KISS-GP and not the error in the predictions. This procedure is repeated for all the nine properties considered. We observe that the distribution is unimodal, with a peak closer to zero in most of the cases, which suggests that the models are reasonable with the predictions for most of the data points in the test set exhibiting high confidence. We also observe that the distribution for some properties is notably broader than others—for example, shear modulus, Young's modulus, liquidus temperature, and hardness. Specifically, hardness exhibits the maximum standard deviation among all the properties suggesting lower reliability in the predictions. Such decreased reliability for hardness could be attributed to the spread in the dataset and also to the nature of the property itself. Unlike other properties such as Young's modulus, hardness is not a material property and depends highly on measurement techniques and conditions used.34–36 Thus, the final hardness model developed on the dataset with higher noise exhibits a larger standard deviation.
Second, we analyze the contribution of each of the input components in the glass composition toward the standard deviation in the prediction. To this extent, we select the glass compositions having non-zero value for a given component in the test dataset. For the selected compositions having the given component, the standard deviations associated with the KISS-GP predictions are computed. Then, the mean value of the standard deviations obtained for all the predictions is computed for a given component. For example, to compute the mean standard deviation associated with SiO2 for Young's modulus, all the glass compositions having non-zero SiO2 values are selected from Young's modulus dataset. Then, the standard deviation associated with the KISS-GP predictions for each of the compositions is computed. The mean value of the standard deviations thus obtained provides the mean standard deviation associated with SiO2 for Young's modulus. The procedure is repeated for all 37 components and nine properties.
Fig. 4 shows the mean value of the standard deviation for the predictions of Young's modulus corresponding to each of the input components considered. We observe that there is a direct correlation between the mean standard deviation and the frequency of components in the dataset. Specifically, components exhibiting high values of standard deviations are the ones that have very few data points (see Fig. S1 in the ESI†). On the contrary, the components that are present in many glasses, such as SiO2, Na2O, Al2O3, and CaO, to name a few, exhibit low standard deviation and spread in the prediction (see Fig. S1 in the ESI†). Similar behavior is observed for other properties as well (see Fig. S4 in the ESI†). This observation is in agreement with the fact that the training for a particular input component improves if there is enough data associated with that particular component. Overall, the results suggest that the predictions for components that are present in a larger number of glasses are more reliable and vice versa. As such, the performance of the model could be improved by increasing the number of glasses corresponding to those components for which the data is at present sparse. Note that the model doesn’t exhibit any trend in terms of the accuracy of predictions with respect to the number of components in the glass (see ESI†). That is, whether a glass consists of two components or ten components, the predictive accuracy of model is comparable.
Fig. 4 Mean standard deviation in the predictions of Young's modulus for each of the components present in the glass compositions. |
Finally, to check whether the model can capture the underlying physics, we focus on a binary sodium borate (NB) glass series, that is, (Na2O)x·(B2O3)1−x. This glass series exhibit the well-known boron anomaly,34,37–39 wherein the properties exhibit a highly non-monotonic variation owing to the variable coordination of boron (three to four) with increasing sodium content. Specifically, most of the glass properties exhibit an inflection for sodium content varying from 30% to 50%. Fig. 5 shows the nine properties predicted for the binary NB glass with the 2σ and 3σ confidence intervals, along with the experimental values. First of all, we observe that the predictions exhibit a close match with the experimental values for all the properties. Second, we observe that the model exhibits a significantly lower standard deviation for the domain where experimental data exist and a high standard deviation for the regions where data is not available. This suggests that the model exhibits increased reliability for interpolation, while confidence decreases for extrapolation. Third, and most interestingly, we observe that the model can capture the boron anomaly for all the properties. Precisely, even for properties such as shear modulus, Young's modulus, refractive index, and hardness, which have few points beyond the model, can capture the non-monotonic behavior associated with the boron anomaly in agreement with the theoretical models.
To test the generality, we choose a ternary glass composition of sodium borosilicate, x(Na2O)·y(B2O3)·1 − x − y(SiO2). Fig. 6 shows the standard deviation of predicted values for the entire range of the ternary using the trained KISS-GP model. The mean values of the properties representing the best estimate of the model for this ternary are provided in the ESI.† The compositions corresponding to the measured values in the original data (which may belong to training, validation, or test set) are marked using the black squares. We observe that the standard deviation in the predictions of compositions close to the original dataset is significantly low. As the compositions are farther away from the original dataset, the standard deviation of the predicted value increases. This behavior is consistently observed for all the properties (see Fig. 6). This is because, in KISS-GP, the training is carried out by identifying the distribution that reduces the variance for a known data point (that is, training data) to zero or at least very close to it. As such, when the model is extrapolated to domains without any training data, the inference becomes poor as represented by larger standard deviation values. Nevertheless, we observe that the standard deviation for most of the properties is relatively low, confirming high confidence in the values predicted by the model. Overall, we observe that KISS-GP allows the development of reliable composition–property models, quantifying uncertainty in predictions when extrapolated over the entire compositional space.
Now, we compare the performance of the GPR models with some of the state-of-the-art ML models.21Table 1 shows the R2 values of KISS-GP in comparison to linear regression, XGBoost, and DNN (see ref. 21 for details) on the test dataset. Note that the dataset used for training all the models are the same.21 Further, only the test dataset R2 values are shown to have a fair comparison of the models on the unseen dataset. In terms of the R2 values of the overall dataset, GP performs better than all other methods. We observe that for seven out of nine properties, KISS-GP outperforms all other methods, including DNN, in terms of the R2 values of test data. The increase in the R2 for some of these properties ranges from 3–5%, for example, liquidus temperature, Young's modulus, shear modulus, which is a notable increase in the accuracy. These results confirm that the predictions obtained from KISS-GP and MSGP are reliable and superior to other state-of-the-art methods, as presented in Table 1. Besides, the uncertainty quantification in KISS-GP and MSGP-based property predictions is quite useful to interpret the model validity in experimentally unexplored regimes of the compositional space. This feature is severely lacking in the other deterministic models. At this point, it should be noted that although KISS-GP is able to capture the mean values of the function correctly, the prediction of the standard deviation is highly sensitive to the training process. This could be attributed to the deep kernel layer (DKL) present in the KISS-GP framework, which reduces the dimensionality of the input feature while training the model. This is a disadvantage associated with the KISS-GP in comparison to the classical GP, which can be addressed to a reasonable extent by hyperparametric optimization of the DKL and early stopping.
Property | Linear regression | XGBoost | DNN | KISS-GP |
---|---|---|---|---|
Bulk modulus | 0.75 | 0.87 | 0.89 | 0.90 |
Shear modulus | 0.77 | 0.86 | 0.88 | 0.91 |
Young's modulus | 0.78 | 0.84 | 0.86 | 0.89 |
Density | 0.92 | 0.95 | 0.95 | 0.96 |
Liquidus temperature | 0.60 | 0.79 | 0.80 | 0.85 |
Refractive index | 0.92 | 0.94 | 0.94 | 0.96 |
Thermal expansion coefficient | 0.67 | 0.78 | 0.80 | 0.79 |
Glass transition temperature | 0.79 | 0.88 | 0.90 | 0.92 |
Hardness | 0.62 | 0.77 | 0.80 | 0.74 |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d0ma00764a |
This journal is © The Royal Society of Chemistry 2021 |