Yuri
Haraguchi
a,
Yasuhiko
Igarashi
bc,
Hiroaki
Imai
a and
Yuya
Oaki
*ac
aDepartment of Applied Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan. E-mail: oakiyuya@applc.keio.ac.jp
bFaculty of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, Tsukuba 305-8573, Japan
cJST, PRESTO, 4-1-8 Honcho, Kawaguchi, Saitama 332-0012, Japan
First published on 18th December 2021
Data-scientific approaches have permeated into chemistry and materials science. In general, these approaches are not easily applied to small data, such as experimental data in laboratories. Our group has focused on sparse modeling (SpM) for small data in materials science and chemistry. The controlled synthesis of 2D materials, involving improvement of the yield and control of the size, was achieved by SpM coupled with our chemical perspectives for small data (SpM-S). In the present work, the conceptual and methodological advantages of SpM-S were studied using real experimental datasets to enable comparison with other machine learning (ML) methods, such as neural networks. The training datasets consisted of ca. 40 explanatory variables (xn) and 50 objective variables (y) regarding the yield, size, and size-distribution of exfoliated nanosheets. SpM-S provided more straightforward, generalizable, and interpretable prediction models and better prediction accuracy for new experiments as an unknown test dataset. The results indicate that machine learning coupled with our experience, intuition, and perspective can be applied to small data in a variety of fields.
SpM is a general concept for explanation of whole data using a limited number of significant descriptors.31,32 The method has been widely used for data compression in image analyses,33–35 such as diagnosis using magnetic resonance imaging. In SpM, the dimension of data is reduced by ML. However, only the automatic selection of descriptors causes rejection of the significant descriptors and/or adoption of insignificant descriptors, particularly in small data. In addition, noise, errors, and outliers in small data have negative effects on the extraction of the descriptors and prediction accuracy of the constructed models. Our group has studied coupling our experience, perspective, and intuition with SpM in all the processes including preparation of the dataset and selection of the descriptors toward development of small-data-driven materials science and chemistry.24–30 In the initial stage, a small yet balanced dataset is prepared using experimental, calculational, and literature data. The descriptors are extracted by ML, such as exhaustive search with linear regression (ES-LiR) and minimax concave penalty and penalized linear unbiased selection algorithm (MC+).36–38 Then, the significant descriptors are selected from the results of ML on the basis of our chemical perspectives as the prior knowledge. This process facilitates finding the significant descriptors with the chemically reasonable correlation to the targets and avoiding overtraining with adoption of insignificant descriptors. The chemical perspectives and implicit knowledge can be included in the predictors. The linear-regression models are constructed using a limited number of the selected descriptors. The straightforward linear regression models are applied to the exploration of optimized conditions and new materials. In the present work, SpM-S was compared with the other ML methods on the same datasets to elucidate the advantages. Three real experimental datasets about the yield, size, and size distribution of exfoliated transition-metal-oxide nanosheets were used for the training and validation (Fig. 1a–c). Linear regression was performed by least absolute shrinkage and selection operator (LASSO) with variable selection and multiple linear regression (ML-R) without variable selection (Fig. 1e). Nonlinear-regression models were constructed by support vector regression (SV-R), random forest regression (RF-R), and neural network regression (NN-R) (Fig. 1f). The results indicate that SpM-S provided more accurate, generalizable, and interpretable prediction models. Moreover, combination of ML and researchers can enable the construction of better predictors even on small data.
Nanosheets, such as monolayers and few-layers, have attracted much interest as 2D materials with characteristic properties.39–41 Although a typical route to obtain 2D materials is exfoliation of layered materials, the processes still have challenges. In general, exfoliation of precursor layered materials is not easily controlled by experimental parameters because of the unpredictable down-sizing processes in the liquid phase. Time- and effort-consuming microscopy analyses are required for the characterization of the resultant nanosheets. Therefore, new approaches are needed to achieve efficient control of the exfoliation behavior toward tailored synthesis of nanosheets. Application of ML to 2D materials has been studied in recent years. ML has been applied to the optimization of processes for bottom-up synthesis,42 exploration of precursor layered compounds,43 estimation of the thickness combined with image analyses on the microscopy images,44,45 and improvement of the properties.46,47 In these previous studies, ML-assisted approaches were not applied to control the exfoliation processes providing 2D materials. Our group has applied ML, namely SpM-S, to achieve high-yield and size-controlled synthesis of nanosheets through controlled exfoliation.24–29 However, the data-scientific validity for construction of the predictors was not studied in our previous reports. In the present work, the validity and advantages of our SpM-S were studied using the small datasets about the yield, lateral size, and lateral-size distribution of nanosheets in comparison with other ML methods.
On the basis of our chemical perspectives, the explanatory variables (xn: n = 1–41) were prepared by the calculation and literature data of physicochemical parameters related to the objective variables (Table 1). The original training datasets contained the following data: 30 y1 and 35 xn (n = 1–26, 28–32, 34, 36, 39, 40) for the yield, 48 y2 and 33 xn (n = 1, 3–24, 26, 28–32, 34, 36, 39, 40) for the size, and 54 y3 and 37 xn (n = 1, 3–24, 26, 27, 30–41) for the size distribution (Fig. 2a and Tables S1–S4 in the ESI†).24–27 The mean, standard deviation (SD), and sample number (n) of the training and test data are summarized in Table S1 in the ESI.† The predictors were constructed by SpM-S and the other ML methods using these original datasets about the yield, size, and size distribution of the nanosheets.
n/— | Parameters | x n for |
---|---|---|
a Literature data. b Calculation data. c Experimental data. | ||
Dispersion media | ||
1 | Molecular weight | y 1, y2, y3 |
2 | Molecular lengthb | y 1 |
3 | Melting pointa | y 1, y2, y3 |
4 | Boiling pointa | y 1, y2, y3 |
5 | Densitya | y 1, y2, y3 |
6 | Relative permittivitya | y 1, y2, y3 |
7 | Vapor pressurea | y 1, y2, y3 |
8 | Viscositya | y 1, y2, y3 |
9 | Refractive indexa | y 1, y2, y3 |
10 | Surface tensiona | y 1, y2, y3 |
11 | Heat capacityb | y 1, y2, y3 |
12 | Entropyb | y 1, y2, y3 |
13 | Enthalpyb | y 1, y2, y3 |
14 | Dipole momentb | y 1, y2, y3 |
15 | Polarizabilityb | y 1, y2, y3 |
16 | HSP-dispersion termb | y 1, y2, y3 |
17 | HSP-polarity termb | y 1, y2, y3 |
18 | HSP-hydrogen bonding termb | y 1, y2, y3 |
Guest molecules | ||
19 | Molecular weight | y 1, y2, y3 |
20 | Polarizabilityb | y 1, y2, y3 |
21 | Dipole momentb | y 1, y2, y3 |
22 | Heat capacityb | y 1, y2, y3 |
23 | Entropyb | y 1, y2, y3 |
24 | Enthalpyb | y 1, y2, y3 |
25 | Molecular lengthb | y 1 |
26 | Layer distancec | y 1, y2, y3 |
27 | Layer distance expansionc | y 3 |
28 | Composition (x)c | y 1, y2 |
29 | Interlayer densityc | y 1, y2 |
30 | HSP-dispersion termb | y 1, y2, y3 |
31 | HSP-polarity termb | y 1, y2, y3 |
32 | HSP-hydrogen bonding termb | y 1, y2, y3 |
Guest–medium combinations | ||
33 | Δ Polarizability (=x14 − x19)b | y 3 |
34 | Δ Polarizability (=|x29|)b | y 1, y2, y3 |
35 | Δ Dipole moment (=x13 − x20)b | y 3 |
36 | Δ Dipole moment (=|x31|)b | y 1, y2, y3 |
37 | Product of dipole moment (=x13 × x20)b | y 3 |
38 | Δ Heat capacity (=x10 − x21)b | y 3 |
39 | Δ Heat capacity (=|x35|)b | y 1, y2, y3 |
40 | HSP distanceb | y 1, y2, y3 |
Host layers | ||
41 | Bulk sizec | y 3 |
(1) |
(2) |
(3) |
As xn was converted to the normalized frequency distribution such that the mean is 0 and standard deviation is 1, the coefficients indicate the weight of the descriptors. The correlation between the estimated (y′) and measured (y) values was represented by a root-mean-square error (RMSE) of 17.9% for , 0.091 for , and 0.116 for (the black circles in Fig. 3a).
Fig. 3 Relationship between the estimated y′ and measured y for the training datasets (black circles) and relationship between the predicted y′ and measured y′′ for the test datasets (colored diamonds) for the yield (top), lateral size (middle), and size distribution (bottom) using the different predictors constructed by SpM-S (a), LASSO (b), ML-R (c), SV-R (d), RF-R (e), and NN-R (f). The red lines indicate the threshold to evaluate the prediction accuracy based on the rate of successful experiments in Table 2. |
The validation of these prediction models was experimentally performed to synthesize the 2D materials under the predicted new conditions (Fig. 2e). These prediction models recommended the host–guest–medium combinations for achieving high-yield and size-controlled synthesis of the exfoliated nanosheets. Prior to the experiments, the predicted , , and values were virtually calculated in unknown 200–2500 host–guest–medium combinations (Fig. 2e). The exfoliation experiments were only performed under the conditions providing the highest and lowest , , and , namely 80, 80, and 46 conditions, respectively. When the precursor layered materials were not synthesized because of experimental reasons, such conditions were excluded from the list. The experimentally measured values (, , and ) are summarized in the test data (Fig. 2f). The test dataset included 79 and 2 xn (n = 18, 40) for the yields, 64 and 8 xn (n = 4, 5, 16, 17, 19, 30, 31, 40) for the size, and 43 and 5 xn (n = 8, 10, 21, 32, 41) for the size distribution (Tables S5–S7 in the ESI†). The relationship between the predicted (y′) and measured (y′′) values was represented by RMSE of 28.4% for yield, 0.451 for size, and 0.240 for size distribution (the colored diamonds in Fig. 3a). Although the accuracy of the prediction models was not very precise, the averages of the higher and lower , and showed significant differences. Therefore, the higher and lower , , and values were selectively achieved in a limited number of experiments using the predictors.
In general, the exfoliation of layered materials is not easily controlled by the experimental parameters based on the chemical insights of senior researchers because of the unpredictable down-sizing processes in the liquid phase. The prediction models facilitated the high-yield and size-controlled synthesis of the nanosheets in the limited number of time-consuming exfoliation experiments.25–27 For example, the number of experiments was reduced to 89% for the high-yield synthesis,25 98% for the lateral-size control,26 and 98% for the control of the size distribution.27 Moreover, the elucidation of the structural and chemical factors facilitates further understanding and control of the exfoliation processes.
LASSO, a typical algorithm for sparse modeling, extracts the descriptors using an L1-regularization term.49 The number of descriptors was reduced to 16 for , 16 for , and 13 for . ML-R uses all the descriptors, i.e. 35 for , 33 for , and 37 for , without extraction of the descriptors. The linear-regression models were constructed using these descriptors and then validated in the training and test datasets. In the three linear-regression models, the relationship between the measured (y) and estimated (y′) values on the training datasets was not significantly different (the black circles in Fig. 3a–c), even though the number of descriptors and their coefficients were different. On the test datasets, SpM-S showed the better correlation between y′ and y′′ compared with the other two linear regression models (the colored diamonds in Fig. 3c–e).
Support vector machine (SVM) solves binary classification problems by formulating them as convex optimization problems.50 The optimization problem entails finding the maximum margin separating the hyperplane, while correctly classifying as many training points as possible. SVM represents this optimal hyperplane with support vectors. The sparse solution and good generalization of the SVM lend themselves to adaptation to regression problems.51 SV-R constructs a nonlinear model based on classification of data including y and y′ with the maximized margins between support vectors.52 RF-R provides a nonlinear model based on multiple decision trees and their ensemble to prevent overfitting.53 NN-R constructs a nonlinear model by graph architectures consisting of input, hidden, and output layers. While these nonlinear prediction models generally had smaller RMSE values for the training data (the black circles in Fig. 3d–f), larger RMSE values were recorded for the test datasets (the colored diamonds in Fig. 3d–f).
SpM-S shows an appropriate prediction accuracy to both the training and test datasets compared with the other ML methods even though the simple linear regression models contain a limited number of descriptors, i.e. two for y1, eight for y2, and five for y3. In these experiments, the exfoliation proceeds with intercalation of the dispersion media in the interlayer space of the layered composites. Then, the swelling induces exfoliation into nanosheets.54–57 The selected descriptors in the models (eqn (1)–(3)) are interpretable and reasonable in the context of the real exfoliation behavior.25–27 The selection of the descriptors assisted by the experience and perspective of researchers contributes to both suppression of the overtraining and construction of a more generalizable model. The descriptors and coefficients of the linear regression models imply the important factors and their weight, respectively. The interpretable model is helpful for designing the next experiments. Although the descriptors and coefficients are clear in the models constructed by LASSO and ML-R, the number of descriptors is not appropriate for identification of the significant factors. In the nonlinear models, it is not easy to extract the positive and negative correlations of each descriptor for discussion based on our chemical insights. Recently, Hatakeyama-Sato and Oyaizu reported that a generative model was used to construct a better predictor with imaging the missing data.23 Although the prediction accuracy is improved by the new method, the interpretability is lowered by combination with the more complex prediction models. In the present work, straightforward and interpretable prediction models are constructed by SpM-S with the assistance of our chemical perspectives using a limited number of descriptors. Therefore, SpM-S is regarded as a preferred method for small data.
The RMSE values of the models based on SpM-S were not always the smallest compared with those based on RF-R and SV-R (Fig. 4). The purpose of our work is the selective synthesis of the larger and smaller y1, y2, and y3 using the predictors in a limited number of experiments. Although the performance of machine learning is generally discussed only with the prediction accuracy,58 the important factor to evaluate the prediction accuracy is not only the RMSE values but also the rate of successful experiments. The rate was calculated as another metric to evaluate the selective synthesis using the predictors in the real experiments (Table 2). If both the predicted (, , and ) and measured (, , and ) values are larger (or smaller) than a certain threshold, the experimental trial can be regarded as successful. After setting the thresholds, the rate of successful experiments (Rs) is defined as the number of successful experiments to the number of predicted conditions. The thresholds were set at 30% for the yield, 0.267 for the lateral size (size reduction rate, L/L0), and 0.266 for the size distribution (LCV). The threshold at 30% for the yield was used in our previous work.25 The thresholds of the lateral size and size distribution were the average values of the training datasets (Table S1 in the ESI†). These thresholds were displayed using the red lines in Fig. 3. For example, the number of conditions with the predicted yield lower and higher than 30% was 11 and 19 by SpM-S (the number of colored plots to the left and right of the red threshold line in Fig. 3), respectively (Table 2). The number of conditions with the measured yield higher and lower than 30% was 9 and 12 (the number of colored plots to the lower left and upper right of the red threshold line in Fig. 3), respectively. Therefore, the Rs values were 0.818 and 0.632 for the high- and low-yield conditions, respectively. The total Rs value including the high- and low-yield conditions was 21/30 = 0.700. In addition to the RMSE values, the accuracy of the prediction models was compared with the Rs values. In the training datasets, the total Rs of SpM-S was lower than those of SV-R and RF-R (Table 2). In contrast, in the test datasets, the total Rs of SpM-S was higher than those of SV-R and RF-R (Table 2). SV-R and RF-R show an overtraining trend compared with SpM-S. In addition to the RMSE values, Rs indicates that the models based on SpM-S have the robustness and better performances. Moreover, the selective syntheses were successfully achieved by the models based on SpM-S in the real experiments. According to the results, not only the RMSE values but also other metrics, such as the Rs values, are important for evaluation of the prediction accuracy in small data.
Yield | SpM-S | SV-R | RF-R | |
---|---|---|---|---|
a The numbers in parentheses (NS/Npred) indicate the number of successful experiments (NS) and predicted conditions (Npred). | ||||
Training | y 1 and < 30% | 0.818 (9/11) | 0.929 (13/14) | 1.000 (16/16) |
y 1 and > 30% | 0.632 (12/19) | 0.813 (13/16) | 1.000 (14/14) | |
Total | 0.700 (21/30) | 0.867 (26/30) | 1.000 (30/30) | |
Test | and < 30% | 0.974 (38/39) | 0.875 (14/16) | 0.750 (6/8) |
and > 30% | 0.500 (20/40) | 0.302 (19/63) | 0.268 (19/71) | |
Total | 0.734 (58/79) | 0.418 (33/79) | 0.316 (25/79) |
The advantage of the models constructed by SpM-S was studied with changes in the training and test data (Fig. 5). The training and test data were mixed and then divided into ten datasets (Fig. 5a). Nine of the datasets and the remaining one were used for training and validation, respectively. The ten-fold segmentalized validation of the training and test data was performed ten times with changes in the assignment of the test data. The average RMSE and its standard deviation in the ten trials were calculated for each method (Fig. 5b–d). The training and validation were performed using the same descriptors in Fig. 4. Although the RMSE values decreased with an increase in the quantity of the training data compared with that in Fig. 4, the overall trends for the accuracy and generalizability did not change much in this ten-fold segmentalized validation (Fig. 5). In the linear-regression models, while the average RMSE values of SpM-S, LASSO, and ML-R on the training datasets decreased in this order, those on the test datasets increased in the same order (arrows in Fig. 5). The results support that a limited number of significant descriptors are selected by SpM-S in small data to avoid overfitting. In contrast, the nonlinear models showed overtraining with large differences in the RMSE values between the training and test datasets. Therefore, the advantage of SpM-S is not characteristic to the original datasets including the training and test datasets in Fig. 2a and f. Moreover, SpM-S can construct generalizable and interpretable prediction models using a limited number of significant descriptors. The methodology can be applied to other small data for acceleration of research activities without wasting time, money, and effort.
The hyperparameter for LASSO, lambda, was determined by five-fold cross validation (CV). In the ten-fold segmentalized validation, the descriptors for ten models were fixed as those used in the training and test data while the coefficients and lambda were different. In other words, the multiple linear regressions were performed ten times using the same descriptors.
The radial basis function (RBF) kernel was used to construct the SV-R model. The hyperparameters, such as gamma, C, and epsilon, were determined by five-fold grid search CV. Grid search CV selects the hyperparameters from each parameter when the model is most accurate within a range of pre-given values. The hyperparameters were tuned for each entry of the ten-fold segmentalized validation.
In RF-R, the parameters to be tuned were maximum tree depth (max_depth) and the number of trees (n_estimators). These parameters were determined by five-fold grid search CV in the range of 1–10 for max_depth and 1–500 for n_estimators. The hyperparameters were tuned for each entry of the ten-fold segmentalized validation.
A multilayer perceptron (MLP) neural network model was used for NN-R. The number of nodes in a hidden layer and the way of updating the learning rate of the weights were tuned by five-fold grid search CV. In the present work, the number of hidden layers was set to one. The number of nodes was set in the range of 22–28. The learning rates were selected from ‘invscaling’, ‘adaptive’ and ‘constant’. In the ten-fold segmentalized validation, the appropriate hyperparameters were tuned for each entry of the ten-fold segmentalized validation.
Footnote |
† Electronic supplementary information (ESI) available: Experimental methods, list of descriptors and all the datasets. See DOI: 10.1039/d1dd00010a |
This journal is © The Royal Society of Chemistry 2022 |