Hanqiu Wang†
a,
Aybek Rehmetulla†b,
Shanshan Guoa,
Xin Konga,
Zhiwei Lüa,
Yu Guana,
Cong Xua,
Kaiser Sulaimanb,
Gongxiang Wei*a and
Huiqiang Liu*a
aSchool of Physics and Optoelectronic Engineering, Shandong University of Technology, Zibo 255000, China. E-mail: liuhq@sdut.edu.cn; weigx@sdut.edu.cn
bInstitute of Traditional Chinese Medicine and Ethnomedicine, Xinjiang Uyghur Autonomous Region, Urumqi 830002, China
First published on 12th April 2022
A single feature set is often unable to effectively classify complex biological samples due to their similar morphology and sizes. This paper proposes a protocol for the fast identification of seed medicinal materials based on micro-structural and infrared spectroscopic characteristics. Three different feature datasets, namely micro-CT, FTIR, and mixed datasets, were established via principal component analysis (PCA) and competitive adaptive reweighted sampling (CARS) and then used to train a back-propagation neural network. The mixed dataset consists of 34-dimensional micro-CT eigenvalues and 13-dimensional FTIR eigenvalues, optimized by PCA and CARS processing and then used to train a BP neural network. The results showed that the classification accuracy reached 89.5% for the micro-CT dataset and 93.3% for the FTIR dataset, and the classification accuracy of the mixed dataset achieved 99.2%, much higher than those of the traditional single feature datasets. This study provides a new protocol for multi-dimensional characteristic architecture with excellent performance for the classification and identification of Chinese medicinal materials.
Current medical botanical identification is mainly based on imaging, spectroscopic, and biochemical techniques as well as manual classification. For example, the identification of six species of Litsea Lam by scanning electron microscopy,1 establishment of the spectral fingerprint of Swertia mileensis for its quality control,2 identification of ginseng from China and Korea by isotope tracing,3 and identification of fake Semiaquilegia adoxoides by biomolecular methods.4 However, these methods are costly and need complex sample preparation and handling procedures. Infrared spectroscopic techniques such as visible/near-infrared spectroscopy (VIS/NIR), near-infrared spectroscopy (NIR), mid-infrared spectroscopy (MIR), and hyperspectral imaging (HSI) can be used to analyse the species and places of origin of plants. For instance, VIS/NIR is used to distinguish crop growing regions and detect nutritional indicators.5,6 Since the spectra contain chemical information about the plant, they can reflect differences in the spectra of plants of different varieties and geographical origin, such as in food quality testing using spectroscopic methods.7,8 X-ray computed microtomography (CT) is a fast, non-destructive, multi-scale testing technique widely used in botany, biomedicine, food testing and security. For example, the identification of New Zealand plant leaf material from artefacts using micro-computed micro-CT,9 imaging of mouse heart using micro-CT and eXIA-160 contrast agents,10 study of micro-deformations of breadcrumbs under severe compression,11 testing of cellular food products,12 and investigation of food microstructure evaluation.13 Due to the combination of high brightness, adjustable energy of synchrotron radiation and micro-CT, high-resolution 3-dimensional X-ray phase contrast imaging techniques are being increasingly employed in plant physiology and in situ structural identification. For example, cavitation and water recharge processes in rice and bamboo leaves were investigated successfully,14 feature tissues and calcium oxalates of wild ginseng were revealed,15,16 and a phase-attenuation-duality-based phase retrieval algorithm effectively improved the density resolution of weakly absorbing materials and successfully visualized the morphological characteristics of eaglewood specimens at different stages.17,18
Machine learning can solve real-world problems by extracting various features based on imaging, such as gray level co-occurrence matrix (GLCM), gray gradient co-occurrence matrix (GGCM) and Tamura texture values,19,20 mass spectrometry, and chromatography.21 Recently, it has been introduced to medical botanical identification and classification, such as in the identification of bitter almond and peach kernel herbs of different origins using neural networks,22 the identification of licorice using support vector machines (SVMs) and random forests,23 and the accuracy comparison of different classifiers of plant specimens.24 Principal component analysis (PCA) is always used as a dimension reduction tool for machine learning, and can significantly reduce the time required for machine learning.
In this study, we propose a new protocol to develop a multi-dimensional intelligent method for identifying highly similar seed herbs. A mixed feature dataset was established using a combination of micro-CT structural and FTIR spectroscopic eigenvalues optimized by PCA and competitive adaptive reweighted sampling (CARS) methods,25 including, GLCM, GGCM and Tamura texture values, and the FTIR vibration absorption feature peak (VAFP) technique. Then, a classifier based on a BP neural network was introduced and trained. The classification accuracy of the mixed dataset is much higher than that of the traditional dataset. This protocol has great potential in dealing with other problems of classification in medical botany.
The post-processing sample data were collected via feature extraction, combination and classification as shown in the flow diagram in Fig. 2. Both modality datasets were compared quantitatively. For the SR-XPCT dataset, the structural features were extracted via the GLCM, GGCM and Tamura texture methods. For the SR-FTIR dataset, the spectral features of the VAFPs were obtained through the CARS method, which is applied to the selection of the optimal subset of wavelengths. Its workflow usually includes four successive steps: Monte Carlo model sampling, forced wavelength reduction by an exponential decrement function (EDF), adaptive re-weighted sampling (ARS), and calculation of RMSECV for each subset. The mixed feature dataset was built based on both modality features and optimized by PCA.
Fig. 2 Schematic diagram of the multi-dimensional sample data acquisition and analysis system based on SR-XPCT and SR-FTIR in SSRF and the experimental flow chart. |
A 3-layer BPNN (input, hidden, output layers) was introduced in this study, characterized by its compact structure to avoid long training time and the tendency of overfitting. The nodes of the hidden layer were essential to the accuracy of classification, and can be determined by the formula , where h is the number of nodes in the hidden layer, m and n are the number of nodes in the input and output layers, respectively, and a is the adjusting constant integer between 1 and 10. In this study, the number of nodes in the hidden layer was 11, and three feature datasets (SR-XPCT, SR-FTIR, mixed feature datasets) were used to train our BPNN adopt software Weka (version 3.6.15). Here, 70%, 15%, and 15% randomly selected points were applied for the training set, test set and validation set, respectively. The performance of BPNN was evaluated with four assessment criteria, as shown in Table 1. TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively. Accuracy is the most intuitive performance measurement. It is the ratio of the number of correctly predicted samples to the total number of samples. Precision is the ratio of correctly predicted positive samples to the total predicted positive samples. Recall rate is the ratio of correctly predicted positive samples to all actual positive samples. F1 is a weighted average of precision and recall. The closer this value is to 1, the more accurate the model's predictions.
The CARS method was employed to obtain the optimal variable number (M) of VAFPs using an iterating algorithm, as shown in Fig. 4(a). The number of sampling runs (N) can be effectively determined through the calculation of the lowest RMSECV value. In this study, the initialization parameters of CARS are A = 10 (maximum number of principal components), K = 10 (k-fold cross-validation), and N = 50 (maximum number of Monte Carlo sampling runs). The values of RMSECV fluctuated and approached zero at a given N, as shown in Fig. 4(b). The regression coefficient path reflected the optimal course of the eigenvalue number with increasing number of sampling runs, as shown in Fig. 4(c). The results of the CARS analysis showed that the RMSECV was minimal when the iteration proceeded to the 49th iteration, at which time there were 13 points with high specificity. It can be seen that CARS enabled the optimization of the data model, especially in increasing the identification of similarly structured samples. Fig. 4(d) presents the FTIR spectroscopic distribution and selected eigenvalues of the 7-species VAFPs.
The mixed feature dataset was built by the combination of both SR-XPCT and SR-FTIR optimal eigenvalues. The PCA method was applied to optimize the SR-XPCT and mixed feature sets to 5-dimensional and 8-dimensional feature sets, respectively, as shown in Fig. 5(a). Fig. 5(b) and (c) reflect the results of selected PCAs and data concentration, respectively, for the XPCT and mixed feature datasets. After PCA processing, the data in the mixed dataset were richer than those in the SR-XPCT dataset. The data in the CT dataset were more scattered, with large data overlap between classes, which made correct classification difficult. The data in the mixed dataset were more easily distinguishable due to the greater concentration of data. The numbers of three statistical samples are listed in Table 2. The total number of samples was 6642, and a total of 4650 samples were randomly selected from each species as the training set, 996 as the validation set, and 996 as the test set.
Class name | SR-XPCT | SR-FTIR | Mixed | ||||||
---|---|---|---|---|---|---|---|---|---|
A | B | C | A | B | C | A | B | C | |
Class_1 | 750 | 125 | 125 | 740 | 130 | 130 | 730 | 135 | 135 |
Class_2 | 609 | 136 | 136 | 601 | 140 | 140 | 631 | 125 | 125 |
Class_3 | 664 | 168 | 168 | 670 | 165 | 165 | 676 | 162 | 162 |
Class_4 | 494 | 120 | 120 | 502 | 116 | 116 | 498 | 118 | 118 |
Class_5 | 685 | 161 | 161 | 685 | 161 | 161 | 719 | 144 | 144 |
Class_6 | 754 | 133 | 133 | 748 | 136 | 136 | 706 | 157 | 157 |
Class_7 | 694 | 153 | 153 | 704 | 148 | 148 | 690 | 155 | 155 |
For further assessment of the classification performance of the three datasets, the method of regression analysis was employed in this study to analyse the relationship between dependent variables and independent variables by way of mathematical expression. As shown in Fig. 8(a), the regression values of the SR-XPCT, SR-FTIR and mixed dataset were 0.849, 0.956 and 0.969, respectively. The regression value of the SR-FTIR dataset was significantly higher than that of the SR-XPCT dataset, indicating that the classification performance of the SR-FTIR dataset was better. The regression line for the mixed dataset, closest to the idealized value, reflected that the trained results of the mixed dataset model had the clearest mapping to the true species. Receiver operating characteristic curves (ROCs) are generally used to evaluate the classification results of machine learning. The test results of the three datasets were calculated by ROC, as shown in Fig. 8(b). There are misclassifications between each species in both the SR-SPCT and SR-FTIR datasets due to insufficient discrimination using features of one single attribute, such as for class_1 and class_5 in the SR-XPCT dataset. Their structural features are so similar that their classification accuracies become relatively low. Moreover, the SR-FTIR dataset also suffers from the similarity problem, although the average accuracy is higher than that of the SR-XPCT dataset. This suggested that we need to combine more different attribute features into the optimal feature dataset to further discriminate similar and complex samples. Therefore, the mixed feature dataset was proposed for the autoclassification of seed samples. In comparison with the above datasets, the ROC of the mixed dataset exhibited better accuracy of classification (up to 99.2%). This demonstrated that a multi-attribute feature dataset can effectively improve the accuracy of classification for complex specimens, especially for biological samples. Indeed, multi-dimensional imaging methods have great potential in providing sufficient raw feature data for the identification of complex systems, combining the methods of feature extraction, PCA, and classifiers.
Since confusion matrixes allow us to obtain the specific number of misclassifications, the confusion matrixes of the three types of datasets are shown in Fig. 9. In the SR-XPCT dataset, class_1 and class_5 had a much higher number of misclassifications compared to the other species, as illustrated in Fig. 8(b), which demonstrated that the similarity of microstructural features affects the accuracy of classification. For the SR-FTIR dataset, class_2, class_3, class_4 and class_6 were easily confused; this result was complementary to that of the SR-XPCT dataset and suggested that an approach combining microscopic and spectroscopic features was feasible. Our mixed dataset combining two features demonstrates this approach, and all species can be distinguished with high accuracy. Taking advantage of the complementary properties between different features, the mixed dataset is superior in identifying samples of similar species.
Footnote |
† Equal contributors. |
This journal is © The Royal Society of Chemistry 2022 |