On feature selection for supervised learning problems involving high-dimensional analytical information
Abstract
Several computational methods were applied to feature selection for supervised learning problems that can be encountered in the field of analytical chemistry. Namely, Genetic Algorithm (GA), Firefly Algorithm (FA), Particle Swarm Optimization (PSO), Least Absolute Shrinkage and Selection Operator (LASSO), Least Angle Regression Algorithm (LARS), interval Partial Least Squares (iPLS), sparse PLS (sPLS), and Uninformative Variable Elimination-PLS (UVE-PLS). Methods were compared in two case studies which cover both supervised learning cases; (i) regression: multivariate calibration of soil carbonate content using Fourier transform mid-infrared (FT-MIR) spectral information, and (ii) classification: diagnosis of prostate cancer patients using gene expression information. Beside quantitative performance measures: error and accuracy often used in feature selection studies, a qualitative measure, the selection index (SI), was introduced to evaluate the methods in terms of quality of selected features. Robustness was evaluated introducing artificially generated noise variables to both datasets. Results of the first case study have shown that in order of decreasing predictive ability and robustness: GA > FA ≈ PSO > LASSO > LARS (errors of 1.775, 4.504, 4.055 mg g−1, 10.085, and 10.510 mg g−1) are recommended for application in regression involving spectral information. In the second case study, the following trend: GA > PSO > FA ≈ LASSO > LARS (accuracies of 100, 95.12 and 90.24%) has been observed. Strong robustness has been observed in the regression case with no decrease in SI for GA, and SI decreasing from 28.85 to 10.26, and 36.11 to 21.05%, for FA and PSO, respectively. In the classification case, only LARS exhibited a considerable decrease in accuracy upon introduction of noise features. Major sources of errors were identified and mostly originated from the analytical methods themselves, which confirmed strong applicability of the evaluated feature selection methods.