Yiying Zhao,
Susu Zhu,
Chu Zhang,
Xuping Feng,
Lei Feng* and
Yong He
College of Biosystems Engineering and Food Science, Zhejiang University, 866 Yuhangtang Road, Hangzhou 310058, China. E-mail: lfeng@zju.edu.cn
First published on 3rd January 2018
Seed variety classification is important for assessing variety purity and increasing crop yield. A hyperspectral imaging system covering the spectral range of 874–1734 nm was applied for variety classification of maize seeds. A total of 12900 maize seeds including 3 different varieties were evaluated. Spectral data of 975.01–1645.82 nm were extracted and preprocessed. Discriminant models were developed using a radial basis function neural network (RBFNN). The influence of calibration sample size on classification accuracy was studied. Results showed that with the expansion of calibration sample size, calibration accuracy varied slightly, but prediction accuracy changed from the increasing form to the stable form. Accordingly, the optimal size of the calibration set was determined. Optimal wavelength selection was conducted by loading of principal components (PCs). The RBFNN model developed on optimal wavelengths with the optimal size of the calibration set obtained satisfactory results, with calibration accuracy of 93.85% and prediction accuracy of 91.00%. Visualization of classification map of seed varieties was achieved by applying this RBFNN model on the average spectra of each sample. Besides, the procedure to determine the optimal sample quantity proposed in this study was verified by support vector machine (SVM). The overall results indicated that hyperspectral imaging was a potential technique for variety classification of maize seeds, and would help to develop a real-time detection system for maize seeds as well as other crop seeds.
Several approaches have been developed and applied for variety classification of seeds, such as morphology identification, DNA molecular marker technology, and protein electrophoresis.5–7 But most of these traditional methods have some limitations such as being time-consuming, and requiring specialized instruments and skilled operators, which restricts their application in on-line and large-scale detection in modern seed industry. To overcome these shortcomings, great focus has been put on developing fast, non-destructive and reliable methods for seed identification and classification. In this study, rapid variety discrimination based on hyperspectral imaging and chemometrics was investigated.
Hyperspectral imaging, an emerging technique that integrates both spectroscopic and imaging techniques in one system,8 has the advantage of providing external (surface and spatial) and internal quality information simultaneously. Each pixel within the image contains a spectrum at the spectral range of the hyperspectral imaging system. By combining the corresponding spatial distribution of each pixel, visualization of sample features (physical, chemical, and category) can be realized.9 Previous studies explored the possibility of applying hyperspectral imaging for variety classification of maize seeds. Zhang et al.10 employed hyperspectral imaging to differentiate 330 maize seeds, including 6 varieties, and the optimal recognition accuracy of 98.89% was achieved by the least squares-support vector machine (LS-SVM) model based on data fusion. Wang et al.11 combined spectral data with textural features obtained from hyperspectral images for classifying 400 maize seeds, including 3 varieties, achieving an accuracy of 88.89%. Huang et al.4 employed hyperspectral imaging for classifying 2000 maize seeds, including 4 varieties that were harvested in different years, and the prediction accuracy of the LS-SVM model coupled with model updating reached 94.4%. Moreover, Huang et al.12 applied hyperspectral imaging to classify 1632 maize seeds, including 17 varieties, and the test accuracy of the LS-SVM model based on the combination of spectral and image features coupled with feature transformation was over 90%.
The successful application of hyperspectral imaging greatly depends on the established calibration models with high accuracy and robustness, which is particularly important for real-time and on-line detection. In general, the above researches mainly focused on effective variable extraction, data fusion, and model updating to optimize the discriminant models for maize seeds. However, besides the mentioned approaches, the number of samples in the calibration set was also reported to be an essential factor affecting the accuracy and robustness of the established calibration models.13 Kuang14 evaluated the effect of the number of fresh soil samples in the calibration set on the prediction error considered as root mean square error of prediction (RMSEP) for farm-scale modelling of total nitrogen, organic carbon and moisture content. Results illustrated that the calibration models built by the large-size calibration set would result in lower RMSEP than those built by the small-size calibration set for all the three soil properties investigated. To our knowledge, however, there is no specific research on how the number of maize seeds in the calibration set affects the classification ability of the discriminant models. In general, a relatively large-size calibration set may be a superior option to describe sample features and may exhibit better prediction capability than a small-size one, for the wider range of variation in the calibration samples. However, the complexity of the discriminant models and the cost of analysing will be considerably higher. The balance between the expected classification accuracy and modelling complexity is worth consideration. Thus, the main purpose of this study was to evaluate the influence of calibration sample size on classification accuracy. To better carry out this work, a large number of maize seed samples should be included. Therefore, a total of 12900 maize seed samples of 3 different varieties were collected in this research.
In all, this study was performed to achieve these objectives: (1) to assess the potential of applying hyperspectral imaging and chemometric methods for differentiating maize seeds, (2) to evaluate the influence of calibration sample size on classification accuracy, (3) to identify optimal wavelengths related to category information, (4) to develop radial basis function neural network (RBFNN) models on optimal wavelengths with the optimal size of the calibration set, and (5) to visualize the classification map of maize seeds for better purity assessment and quality monitoring.
The raw hyperspectral images of the samples were corrected using two reference standards: the white reference image and the dark reference image, obtained under the same condition as sample image acquisition. The white reference image was obtained using a white Teflon bar of nearly 100% reflectance, and the dark reference image was acquired by turning off the light source and completely covering the lens with its opaque cap. Then the corrected image was calculated by the following equation:
(1) |
(1) Samples were isolated from the background and the sample region of each maize seed was defined as an ROI.
(2) Spectral data were extracted from the predefined ROI, and the average spectra of each ROI were calculated and preprocessed.
(3) Calibration models were developed on optimal wavelengths with the optimal number of samples in the calibration set.
(4) The category value of each ROI was predicted by using the corresponding average spectra and the established calibration models.
(5) By describing different varieties of maize seeds with specific colours, classification maps were formed.
Visualization of classification map made it possible to visualize category information of the samples, which was beneficial for convenient and intuitive discrimination of seed variety in crop seed industry, especially for a large quantity of samples to be classified.
The average reflectance spectra of three varieties were calculated and presented in Fig. 1. The average spectra of three varieties exhibited quite similar trends, but differences in reflectance values could also be seen, due to different composition and physicochemical characteristics of three maize seed varieties. The peak at 1116.09 nm may correspond to the C–H groups from lipids.22 The valley at 1203.55 nm may be associated with 2nd overtone of C–H.23 The peak at 1304.60 nm may be assigned to combination between the 1st overtone of N–H stretching with the fundamental N–H in-plane bending and C–N stretching with N–H in-plane bending vibrations.24 The valley at 1469.95 nm may arise from 1st overtone of N–H.23 However, variety classification couldn't be achieved only by the differences of the average spectra. Therefore, latent features of the average spectra needed to be excavated for variety discrimination.
Fig. 1 Average reflectance spectra of maize seeds of three varieties in the range of 975.01–1645.82 nm. |
Calibration accuracy of these RBFNN models varied slightly with different size of the calibration sets, but it remained over 97%. Whereas, prediction accuracy of RBFNN models presented an increasing trend from the small-size calibration set to the large-size one. In addition, prediction accuracy grew fast with the number of calibration samples of each variety rising from 100 to 600, while a slow increasing could be observed between 600 and 1100 calibration samples of each variety. Surprisingly, prediction accuracy based on calibration sets containing more than 1100 samples of each variety remained stable. This phenomenon indicated that the specific 1100 samples of each variety might contain important information for discrimination. Detailed classification results of RBFNN models based on 3000 and 1100 samples of each variety in the calibration set are shown in Table 1.
Samples of each variety in the calibration set | Calibration | Prediction | |||||||
---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | Accuracy | 1 | 2 | 3 | Accuracy | ||
3000 | 1 | 2931 | 12 | 57 | 97.70% | 1180 | 36 | 84 | 90.77% |
2 | 11 | 2961 | 28 | 98.70% | 33 | 1210 | 57 | 93.08% | |
3 | 45 | 24 | 2931 | 97.70% | 38 | 15 | 1247 | 95.92% | |
Total | 98.03% | 93.26% | |||||||
1100 | 1 | 1070 | 5 | 25 | 97.27% | 1192 | 25 | 83 | 91.69% |
2 | 7 | 1084 | 9 | 98.55% | 42 | 1209 | 49 | 93.00% | |
3 | 22 | 6 | 1072 | 97.45% | 51 | 21 | 1228 | 94.46% | |
Total | 97.76% | 93.05% |
For the large-size calibration model (3000 samples of each variety in the calibration set), classification accuracy was 98.03% for the calibration set and 93.26% for the prediction set. It could be explained that the abundant information related to internal quality was contained in the large number of maize seeds in the calibration set. Nevertheless, a small minority of the samples were identified as the wrong variety. In both calibration and prediction sets, samples of variety I and variety II were more likely to be misclassified as variety III, and samples of variety III were more likely to be misclassified as variety I. The small-size calibration model (1100 samples of each variety in the calibration set) obtained similar results, with accuracy of 97.76% for the calibration set and 93.05% for the prediction set, which was relatively satisfactory. It was worth noting that the misclassifying phenomenon matched with that of the large-size calibration model. The composition and physicochemical characteristics between variety III and other two varieties might be more similar.
The overall results indicated that it was feasible to use hyperspectral imaging technique combined with RBFNN models for seed variety discrimination. Additionally, classification results of RBFNN models (calibration accuracy of 98.03% and prediction accuracy of 93.26% based on 3000 samples of each variety in the calibration set, and calibration accuracy of 97.76% and prediction accuracy of 93.05% based on 1100 samples of each variety in the calibration set) demonstrated that using relatively small number of samples to establish calibration model could achieve similar prediction capability with that based on large number of samples. The reason might be that the 1100 calibration samples of each variety involved sufficient samples to explain the seed variability for discrimination. In addition, significant reduction in sample size of the calibration set helped to build a much simpler discriminant model with lower computational complexity and higher efficiency. Therefore, the optimal size of the calibration set was determined as 1100 samples of each variety.
The first three PCs explained most of the total variance (PC1 = 92.65%, PC2 = 7.04%, PC3 = 0.19%). So the PCA loadings of the first three PCs were employed to identify the important wavelengths with the greatest contribution for discrimination. Fig. 4(a–c) shows the PCA loading plots of the first three PCs in the entire spectral range. Wavelengths located at peaks and valleys were chosen as the informative wavelengths of each PC. In total, 15 wavelengths (998.51, 1005.22, 1116.09, 1196.82, 1203.55, 1220.39, 1227.12, 1264.17, 1307.97, 1372.05, 1382.17, 1432.80, 1453.06, 1466.57, 1469.95 nm) were selected to represent the full spectra for further tests to identify and classify maize seeds with reduced data processing.
Samples of each variety in the calibration set | Calibration | Prediction | |||||||
---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | Accuracy | 1 | 2 | 3 | Accuracy | ||
3000 | 1 | 2799 | 48 | 153 | 93.30% | 1176 | 49 | 75 | 90.46% |
2 | 44 | 2907 | 49 | 96.90% | 40 | 1196 | 64 | 92.00% | |
3 | 166 | 65 | 2769 | 92.30% | 83 | 37 | 1180 | 90.77% | |
Total | 94.17% | 91.08% | |||||||
1100 | 1 | 1010 | 20 | 70 | 91.82% | 1184 | 49 | 67 | 91.08% |
2 | 19 | 1063 | 18 | 96.64% | 46 | 1195 | 59 | 91.92% | |
3 | 66 | 10 | 1024 | 93.09% | 87 | 43 | 1170 | 90.00% | |
Total | 93.85% | 91.00% |
For the large-size calibration model (3000 samples of each variety in the calibration set), in comparison with RBFNN model based on the full spectra, model performance developed on optimal wavelengths was slightly worse, but classification accuracy of 94.17% for the calibration set and 91.08% for the prediction set was comparatively receivable. Surprisingly, both based on optimal wavelengths, the small-size calibration model (1100 samples of each variety in the calibration set) achieved quite approximate results with the large-size calibration model. Classification results of 93.85% for the calibration set and 91.00% for the prediction set were relatively acceptable, which further validated the representativeness of the chosen optimal size of the calibration set. And it was worth noting that the number of wavelengths reduced from 200 to 15, which only accounted for 7.50% of the total wavelengths. In this point of view, the simplified discriminant model was better than the RBFNN model developed on the full spectra. Meanwhile, the similar phenomenon was observed in Table 2 as in Table 1 that in both calibration and prediction sets, samples of variety I and variety II were more likely to be misclassified as variety III, and samples of variety III were more likely to be misclassified as variety I.
In all, simpler discriminant models were obtained with signification reduction in computational task by using the selected optimal wavelengths. Besides, the small-size RBFNN model based on these optimal wavelengths with 1100 samples of each variety in the calibration set was much simpler than all other corresponding RBFNN models, and its classification results were relatively satisfactory. The overall results indicated that it was an effective way to select optimal wavelengths to build discriminant models by PCA loadings, with great reduction in computational cost and relatively acceptable model performance.
Calibration accuracy of SVM models on the full spectra remained over 90% except the model with 200 samples of each variety in the calibration set. In particular, calibration accuracy of SVM models developed on more than 1100 samples of each variety in the calibration set were higher than 95.50%. Prediction accuracy of SVM models showed a similar trend with that of RBFNN models, which improved along with the increase in the number of samples in the calibration sets, until the stable point was reached. For the discriminant model developed on 1100 samples of each variety, classification results of SVM models were a little worse than the corresponding RBFNN model, with calibration accuracy of 96.61% and prediction accuracy of 87.31%.
SVM models with 3000 and 1100 samples of each variety in the calibration set were also established on the selected optimal wavelengths, respectively. Detailed results are shown in Table 3. The simplified models were slightly worse than SVM models developed on the full spectra. But the results were acceptable since the number of input variables was greatly reduced.
Samples of each variety in the calibration set | Calibration | Prediction | |||||||
---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | Accuracy | 1 | 2 | 3 | Accuracy | ||
3000 | 1 | 2743 | 39 | 218 | 91.43% | 1099 | 58 | 143 | 84.54% |
2 | 39 | 2906 | 55 | 96.87% | 56 | 1162 | 82 | 89.38% | |
3 | 181 | 52 | 2767 | 92.23% | 117 | 36 | 1147 | 88.23% | |
Total | 93.51% | 87.38% | |||||||
1100 | 1 | 995 | 17 | 88 | 90.45% | 1079 | 58 | 163 | 83.00% |
2 | 11 | 1072 | 17 | 97.45% | 69 | 1156 | 75 | 88.92% | |
3 | 83 | 22 | 995 | 90.45% | 160 | 50 | 1090 | 83.85% | |
Total | 92.79% | 85.26% |
Firstly, measurement by near-infrared spectroscopy only focuses on a relatively small part of the sample being analysed to obtain the average values of quality information.25 And uneven distribution of chemical constituents within the sample would cause much error. However, hyperspectral imaging obtained the spectrum of each pixel within the entire sample. Furthermore, by combining the spatial information acquired by the hyperspectral imaging system, the position of each sample could be fixed and the corresponding variety could be presented in classification maps, showing a great advantage over near-infrared spectroscopy. In a previous research, the variety of every single pixel within the sample was predicted and visualized.11 However, considering that there was no need to know the exact category information of each pixel, the average spectra of the samples were more suitable for prediction and visualization. In addition, the ability of identifying and visualizing a large number of samples simultaneously further confirmed the efficacy of using hyperspectral imaging for real-time detection.
Secondly, visualization effects closely related to the performance of the developed calibration models, which were associated with the spectral features of a certain number of calibration samples. The influence of calibration sample size on classification accuracy was investigated. The trends of classification accuracy were used to identify the optimal size of the calibration set. By developing calibration models on the optimal number of calibration samples, prediction results were comparatively satisfactory and heavy computational task was avoided in the meanwhile. Besides, to explore whether the method to determine the optimal sample quantity could be applied to other calibration methods, SVM was used for verification. Note that although 1100 samples of each variety were considered to be appropriate for building calibration models in this study, it is recommended to identify the number of calibration samples in terms of required accuracy in practical tests.
Thirdly, a major problem was that the spectral data extracted from hyperspectral images were quite large, and were suffered from collinearity and redundancy. Dealing with such data generated heavy cost of computation and had high requirements of analysing hardware. Optimal wavelength selection was a useful tool to reduce the amount and redundancy of data, and helped to build a robust and simple model. Moreover, the computational cost was also reduced, corresponding to lower requirements of analysing.
Visualization of seed variety was based on the robust and representative calibration model developed on the optimal size of calibration samples and optimal wavelengths. Satisfactory performance of visualization verified the feasibility of using hyperspectral imaging to differentiate and visualize varieties of maize seeds, providing an efficient way for seed purity assessment and quality monitoring in maize seed industry. In addition, by combining assessment of chemical composition, viability, germination ability, insect damage and diseases, it was possible to develop a real-time system for comprehensive quality monitoring for maize seeds in the future, as well as for other crop seeds.
This journal is © The Royal Society of Chemistry 2018 |