Weiran
Song
a,
Zongyu
Hou
a,
Muhammad Sher
Afgan
a,
Weilun
Gu
a,
Hui
Wang
b,
Jiacheng
Cui
a,
Zhe
Wang
*a and
Yun
Wang
*c
aState Key Lab of Power Systems, International Joint Laboratory on Low Carbon Clean Energy Innovation, Department of Energy and Power Engineering, Tsinghua University, Beijing, 100084, China. E-mail: zhewang@tsinghua.edu.cn
bSchool of Computing and Engineering, University of West London, London, W5 5RF, UK
cRenewable Energy Resources Laboratory (RERL), Department of Mechanical and Aerospace Engineering, University of California, Irvine, CA 92697-3975, USA. E-mail: yunw@uci.edu
First published on 22nd October 2020
Laser-induced breakdown spectroscopy (LIBS), an emerging elemental analysis technique, provides a fast and low-cost solution for coal characterization without complex sample preparation. However, LIBS spectra contain a large number of uninformative variables, resulting in reduction in the predictive ability and learning speed of a multivariate model. Variable selection based on a single criterion usually leads to a lack of diversity in the selected variables. Coupled with spectral uncertainty in LIBS measurements, this can degrade the reliability and robustness of the multivariate model when analysing spectra obtained at different times and conditions. This work proposes a validated ensemble method for variable selection which uses six base algorithms and combines the returned variable subsets based on the cross-validation results. The proposed method is tested on two sets of LIBS spectra obtained within one month under variable experimental conditions to quantify the properties of coal, including fixed carbon, volatile matter, ash, calorific value and sulphur. The results show that the multivariate model based on the proposed method outperforms those using benchmark variable selection algorithms in six out of the seven tasks by 0.3%–2% in the coefficient of determination for prediction. This study suggests that variable selection based on ensemble learning improves the predictive ability and computational efficiency of the multivariate model in coal property analysis. Moreover, it can be used as a reliable method when the user is not sure which variables to choose in LIBS application.
Traditional methods of measuring coal properties mainly rely on chemical processes such as liquefaction and carbonization. These methods are highly accurate but have disadvantages of being time-consuming and expensive, which cannot meet the growing demand for real-time analysis in coal-fired power plants. Current analytical technologies for rapid analysis of coal properties are X-ray fluorescence (XRF) and prompt gamma neutron activation analysis (PGNAA).3 However, XRF is not suitable for analysing low atomic number elements such as C and H, while PGNAA's neutron source poses potential health hazards. Moreover, both technologies are costly in terms of installation and maintenance and require strict regulated operations. Thus, low-cost technologies with fast, simple and safe operations are urgently needed for multi-element analysis of coal.
Laser-induced breakdown spectroscopy (LIBS) is an emerging atomic emission spectroscopic technology that determines the elemental composition and concentration based on the spectrum of laser-generated plasma. LIBS has been widely used to study coal properties due to its advantages of rapidness, minimal sample preparation and simultaneous determination of multiple elements.3,4 To determine the relationship between LIBS spectra and the investigated coal property, the simplest method is univariate analysis which manually selects one spectral line corresponding to a specific element based on theoretical or empirical knowledge.4,5 However, this method often yields unsatisfactory performance because uncontrollable fluctuations of experimental parameters and matrix effects can distort the relationship between the intensity and concentration.6 Another method is multivariate analysis which utilises the entire spectrum or a subset of variables (also called features) to extract quantitative information.7 When spectral signatures overlap due to inter-element interference,8 multivariate methods such as partial least squares (PLS) and its variants can easily outperform conventional univariate models in coal property analysis.9
LIBS data usually contain several tens of thousands of variables due to the wide wavelength range, high spectral resolution and complex sample components.10,11 A large proportion of these variables are redundant and irrelevant: (i) a variable may be highly correlated with one or more of the other variables, and (ii) a variable may not be correlated with the predictive response.7 Although the use of chemometrics and machine learning models has become an analytical routine for processing LIBS data,12,13 the large numbers of redundant and irrelevant variables can significantly reduce the predictive ability and computational efficiency of the models.14 Therefore, variable selection is generally performed prior to multivariate analysis of LIBS data. For the analysis of coal properties, recent studies have attempted using several model-based variable selection methods that identify important variables based on the result of specific algorithms. Li et al. employed a genetic algorithm (GA) and synergy interval partial least squares (siPLS) to improve the coefficient of determination (R2) and the root mean square errors (RMSE) of calorific value measurement from 0.9851 to 0.9936 and from 0.7310 MJ kg−1 to 0.4580 MJ kg−1 respectively, in prediction.15 Yan et al. applied particle swarm optimization (PSO) to select 1000–4000 variables from 27620-dimensional LIBS data, which assisted a kernel extreme learning machine (K-ELM) to achieve better results than using characteristic lines to determine the volatile matter and calorific value.16 Besides, variable selection methods such as competitive adaptive reweighted sampling (CARS) and the successive projection algorithm (SPA) were combined with PLS to quantify nitrogen and sulphur contents.17
Variable selection provides an effective way to improve the modelling performance and learning speed without prior knowledge of the domain. Nevertheless, selecting an appropriate subset of variables remains difficult from the user perspective. Variable selection is traditionally based on a single criterion which may not deal with cases with different variances and other statistical properties equally well.18,19 When the analytical task or the composition of input data changes, using a single variable selector can yield less satisfactory results due to the lack of diversity in the selected variables.20 For example, K-ELM with PSO has better prediction performance than K-ELM based on characteristic lines for volatile matter and calorific value characterization, while the latter is more accurate than the former for quantifying the ash content.16 Furthermore, matrix effects and fluctuating experimental parameters often lead to a high degree of spectral uncertainty in long-term LIBS measurements.3,21 The intensity of spectral lines at the same wavelength can significantly change over time even for the same sample. Under such circumstances, some useful variables may be weakly related to the response according to a specific variable selection criterion. By discarding these spectral lines, the reliability and robustness of the multivariate model will be reduced when analysing new data obtained at different times and conditions.
In this work, we aim to improve the predictive capability and computational efficiency of the multivariate model for LIBS data analysis through a combination of variable selection and ensemble learning. We propose a validation-based ensemble method for variable selection which adopts six variable selection algorithms as base selectors to ensure the diversity of the selected variables. This method selects variables by aggregating the variables selected by the base selectors based on the cross-validation results of PLS regression. This method is evaluated by experiments using LIBS spectra collected within a month under fluctuating experimental parameters. A PLS regression model is constructed to quantify coal properties, including fixed carbon, volatile matter, ash, calorific value and sulphur content. The experimental result shows that the proposed method outperforms the benchmark methods in six out of the seven tasks. Thus, variable selection based on the ensemble strategy can improve the predictive ability and computational efficiency of the multivariate model in coal property analysis. Moreover, it can be used as a reliable method when the user is not sure which variable selection algorithm to use for a given task.
y = Xβ + e | (1) |
The use of regression vector β is a straightforward method to assess the importance of each variable. Variables with RC values close to zero can be eliminated, and variables with large absolute values of RC are important. The main disadvantage of RC is that it does not consider the combination effect of variables.22
• Least absolute shrinkage and selection operator (LASSO):23 the LASSO method is a popular penalized regression approach, which minimizes the residual sum of squares and a penalty term. The LASSO estimator is defined as
(2) |
• Competitive adaptive reweighted sampling (CARS):26 the CARS method is an efficient variable selection method that follows the principle of “survival of the fittest”. Each set of wavelength variables will compete with each other through their own weight, the sets of large weights will remain and the rest will be removed. CARS first calculates the absolute values of PLS regression coefficients to evaluate the importance of each variable. Then N subsets from N Monte Carlo sampling runs are sequentially obtained based on the importance of each variable. A two-step variable reduction procedure (exponentially decreasing function and adaptive reweighted sampling) is employed to select important variables. Finally, by comparing the root mean square errors of cross-validation (RMSECV), the subset with the minimum RMSECV is selected. It is noted that CARS often selects a small number of variables and is unstable, which limits its use on specific types of data.27
• Recursive weighted partial least squares (rPLS):28 the rPLS method iteratively reweights the variables to magnify important variables based on the process of repeated PLS modelling. The new input data XR are updated as
XR = XR−1 × diag(β) | (3) |
• Significance multivariate correlation (sMC):29 the sMC method statistically determines the importance of a variable by combining regression variance and residual variance from the PLS regression model. To prevent the influence of irrelevant information contained in X, the calculation is performed based on the target projection (TP) without orthogonal variance decomposition,30 which can be represented as
X = XsMC + EsMC = tTPwTTP + EsMC | (4) |
(5) |
The main drawback of sMC is that it does not highlight the most important variables for interpreting models.31
• Minimum redundancy maximum relevance (mRMR):32 the mRMR uses mutual information to select a variable subset that has the highest relevance with the response y, subject to the constraint that selected variables are maximally dissimilar to each other.
The combination of results (subsets of variables or rankings of all variables) to obtain a final output is a crucial point in ensemble variable selection.37 The most typical way of combining subsets of variables is to calculate their intersection or union while combining the rankings of all variables includes minimum, maximum and mean calculations.20 The general evaluation of ensemble variable selection is based on the prediction performance of the final learner.
(1) Base selections: the proposed method is based on a heterogeneous ensemble, which individually adopts base selectors including RC, LASSO, CARS, rPLS, sMC and mRMR on training data.
(2) Validation: the subsets of variables returned by the six base selectors are fed to PLS regression to determine the rankings of the selectors according to the corresponding RMSECV values.
(3) Partial aggregation: based on the above rankings, we merge the subsets obtained by the first three selectors into the final variable subset.
A heterogeneous ensemble is a suitable option when the sample size of data is small or the user is not sure which available selector to use.20 Generally, the number of coal samples in many LIBS studies is small due to the high complexity and expense of conventional analysis.15,16 This, coupled with different data splitting, may lead to ranking fluctuations of different selectors. Instead of computing the intersection and the union of subsets, the proposed method partially combines subsets based on the results of 5-fold cross-validation. On one hand, it avoids an extremely small number of variables in the intersection set, which may lead to performance degradation in practice.37 On the other hand, the partial combination can significantly reduce the number of variables compared to the union of subsets. This work adopts the first three selectors to balance the prediction performance and the number of selected variables. The subset of variables determined by the proposed method is then used to construct a PLS regression model.
Input | Predictor variables X, response y, the number of latent variables k |
Output | T (X-score matrix), P (X-loading matrix), q (y-loading vector), regression coefficients β |
1 | Compute the loading weights: w = XTy |
2 | Normalize the weights: w = w/‖w‖ |
3 | Compute the score of X: t = Xw |
4 | Compute the loading vector of X: p = XTt/(tTt) |
5 | Compute the loading vector of y: q = yTt/(tTt) |
6 | Deflate X and y: X = X − tpT, y = y − tq |
7 | Store w, t, p and q in W, T, P and q respectively |
8 | Return to step 1 until reaching k latent variables |
9 | Compute the regression coefficients: β = WT(PWT)−1Q |
The optimal number of LVs in PLS regression, RC, CARS and rPLS was searched from 1 to 10 in the case of overfitting. Since the number of Monte Carlo sampling runs in CARS does not significantly influence the prediction performance, it was adjusted to 100 for simplicity. The significance level of sMC was set to a default value of 0.01. The number of variables selected by RC and mRMR was 500 to achieve a good trade-off between prediction performance and the number of variables. Additionally, the base and proposed selectors were compared to a conventional method of processing LIBS spectra, which identifies spectral peaks, subtracts baseline and calculates the peak area.
The coefficient of determination (R2) and the RMSE of calibration, cross-validation and prediction were used to evaluate the performance of regression models:
(6) |
(7) |
Min | Max | MEAN | STD | ||
---|---|---|---|---|---|
Group-1 | Carbon-1 (%) | 55.42 | 85.56 | 66.36 | 6.26 |
Volatile matter-1 (%) | 2.63 | 32.97 | 22.21 | 6.75 | |
Group-2 | Carbon-2 (%) | 23.97 | 53.07 | 40.01 | 4.92 |
Volatile matter-2 (%) | 16.72 | 33.15 | 24.02 | 3.4 | |
Ash (%) | 8.5 | 57.76 | 34.34 | 8.86 | |
Calorific value (MJ kg−1) | 10.84 | 27.74 | 20.12 | 2.73 | |
Sulphur (%) | 0.3 | 2.63 | 1.09 | 0.72 |
The results of PLS regression on two LIBS data with full variables and subsets of variables selected by different methods are shown in Table 3. The proposed ensemble method improves the predictive ability of PLS regression for all tasks and outperforms baseline methods except for determining volatile matter in the second group. The average RMSEP of carbon-1, volatile matter-1, carbon-2, ash, calorific value and sulphur measurements is 1.17%, 0.95%, 1.44%, 1.76%, 0.84 MJ kg−1 and 0.12%, respectively. Among all methods, the LASSO ranks the second in four out of the seven tasks followed by RC, which achieves the highest results in the volatile matter-2 task (RMSEP = 0.85%, R2P = 0.93). The PLS regression model combined with the peak area yields unsatisfactory prediction performance. The variables selected based on the peak area are intuitive for chemical interpretation; however, some important variables with low intensities may be ignored, and some of the selected variables may not be related to the response. The calibration and 5-fold cross-validation results of different methods in the seven tasks are given in Table S2.†
Carbon-1 (%) | Volatile matter-1 (%) | Carbon-2 (%) | Volatile matter-2 (%) | |||||
---|---|---|---|---|---|---|---|---|
RMSEP | R 2P | RMSEP | R 2P | RMSEP | R 2P | RMSEP | R 2P | |
All variables | 1.2332 | 0.9597 | 1.023 | 0.9766 | 1.4793 | 0.9052 | 0.871 | 0.9272 |
Peak area | 1.7602 | 0.9179 | 1.6 | 0.9433 | 1.5659 | 0.8935 | 0.9436 | 0.915 |
RC | 1.2181 | 0.9606 | 1.0029 | 0.9774 | 1.4945 | 0.9032 | 0.8534 | 0.9303 |
LASSO | 1.187 | 0.9628 | 0.9698 | 0.9789 | 1.5082 | 0.9018 | 0.8742 | 0.9266 |
CARS | 1.2301 | 0.9598 | 1.0522 | 0.9752 | 1.5943 | 0.8896 | 0.8838 | 0.925 |
rPLS | 1.2109 | 0.9612 | 1.0047 | 0.9774 | 1.5032 | 0.9024 | 0.8871 | 0.9249 |
sMC | 1.2072 | 0.9613 | 0.9909 | 0.9781 | 1.7488 | 0.8685 | 0.8802 | 0.9265 |
mRMR | 1.2954 | 0.9556 | 1.0474 | 0.9756 | 1.6012 | 0.8885 | 0.9016 | 0.9218 |
Ensemble | 1.1711 | 0.9636 | 0.9474 | 0.9799 | 1.4434 | 0.9095 | 0.8584 | 0.9295 |
Ash (%) | Calorific value (MJ kg−1) | Sulphur (%) | ||||
---|---|---|---|---|---|---|
RMSEP | R 2P | RMSEP | R 2P | RMSEP | R 2P | |
All variables | 1.8771 | 0.9547 | 0.8523 | 0.9019 | 0.1447 | 0.9572 |
Peak area | 2.0592 | 0.9454 | 0.8682 | 0.8983 | 0.1377 | 0.961 |
RC | 1.8513 | 0.956 | 0.8518 | 0.9011 | 0.1305 | 0.9654 |
LASSO | 1.824 | 0.9571 | 0.8583 | 0.9003 | 0.1246 | 0.9685 |
CARS | 1.9359 | 0.9518 | 0.9197 | 0.8852 | 0.1308 | 0.9652 |
rPLS | 1.8488 | 0.9559 | 0.8565 | 0.9002 | 0.1322 | 0.9643 |
sMC | 1.9176 | 0.9525 | 0.9291 | 0.8838 | 0.1372 | 0.9615 |
mRMR | 2.0286 | 0.9471 | 0.8924 | 0.8918 | 0.1339 | 0.9634 |
Ensemble | 1.7618 | 0.96 | 0.8436 | 0.9034 | 0.1231 | 0.9692 |
The RMSEP values of different methods are depicted in Fig. 3, where the variations in the results of the 50-run are represented by boxplots. The median RMSEP value of the proposed method is the lowest among all methods in six out of the seven tasks. Furthermore, the standard deviation of the proposed method for volatile substances-1 (0.05), ash (0.17) and sulphur (0.01) measurements are smaller than that of other methods (see Table S3†), which demonstrates the robustness of the ensemble variable selection in specific tasks.
Peak area | RC | LASSO | CARS | rPLS | sMC | mRMR | Ensemble | |
---|---|---|---|---|---|---|---|---|
Carbon-1 | 500 | 500 | 215.88 | 217 | 594.62 | 748.74 | 500 | 699.46 |
Volatile matter-1 | 500 | 500 | 215.32 | 147.62 | 476.14 | 1543.32 | 500 | 479.56 |
Carbon-2 | 500 | 500 | 123.66 | 115.3 | 541.32 | 86.26 | 500 | 485.64 |
Volatile matter-2 | 500 | 500 | 123.28 | 74.74 | 684.3 | 147.86 | 500 | 250.3 |
Ash | 500 | 500 | 123.92 | 111.56 | 631 | 201.86 | 500 | 365.14 |
Calorific value | 500 | 500 | 123.5 | 120.56 | 220.94 | 100.24 | 500 | 408.34 |
Sulphur | 500 | 500 | 123.52 | 95.72 | 934.22 | 138.16 | 500 | 399.58 |
Fig. 4 displays the frequency of each wavelength selected by different methods for two sets of carbon measurement. The proposed method selects the wavelength of 247.87 nm related to C I (247.877 nm) in 42 and 17 of the 50 runs for the two sets, respectively. The wavelength of 473.67 nm and 473.6 nm has the second highest frequency (32 and 25, respectively) for the first set, corresponding to the C2 line (473.622 nm). Moreover, the wavelength of 386.06 nm which may be related to CN (386.143 nm) is selected 24 times for the second set. The frequency of the wavelengths corresponding to the other characteristic lines of carbon does not exceed 10 in most cases. The frequencies of all wavelengths for volatile matter, ash, calorific value and sulphur measurements are presented in Fig. S1.†
The wavelengths of the top frequencies determined by the proposed method is then compared with the specific wavelengths of the peak area method, as shown in Table S4.† Some of these wavelengths correspond to strong spectral peaks, while others have low intensities which are easily overlooked by using the peak area method. According to the NIST atomic spectra database, some possible wavelengths corresponding to coal properties include 247.87 nm (C I for carbon measurement), 404.79 nm (Fe I for ash measurement) and 589.00 nm (the strong Na I line overlapping with the weak S I line for the sulphur content). Besides, it is noted that many high-frequency wavelengths identified by the proposed method and the base methods are not directly related to the investigated coal properties. For example, the wavelengths of 742.38 nm (N I), 324.8 nm (Cu I) and 404.79 nm (Fe I) are also high-frequency for carbon measurement in addition to C I 247.87 nm due to the matrix effects.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d0ja00386g |
This journal is © The Royal Society of Chemistry 2021 |