Meng Xiaab,
Ruifang Yanga,
Gaofang Yina,
Xiaowei Chena,
Jingsong Chenab and
Nanjing Zhao*ac
aKey Laboratory of Environmental Optics and Technology, Anhui Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, 350 Shushanhu Road, Hefei 230031, China. E-mail: njzhao@aiofm.ac.cn; mxia@aiofm.ac.cn; gfyin@aiofm.ac.cn; xwchen@aiofm.ac.cn
bUniversity of Science and Technology of China, Hefei 230026, China. E-mail: cjs1998@mail.ustc.edu.cn
cInstitutes of Physical Science and Information Technology, Anhui University, Hefei 230601, China
First published on 22nd December 2022
This paper proposed a novel spectrometric quantification method for nitrate and COD concentration in water using a double-channel 1-D convolution neural network for relatively long UV-vis absorption spectra data (2600 points). To improve the model's ability to resist turbidity disturbance, a new dataset augmentation method was applied and the absorption spectra of nitrate and COD under different turbidity disturbances were successfully simulated. Compared to the PLSR model, the value of RRMSEP for the CNN model was reduced from 6.1% to 1.4% in nitrate solution and 4.5% to 1.3% in COD solution. Compared to the PLSR model, the regression accuracy of the CNN model was increased from 56% to 93% in nitrate solution and 68% to 91% in COD solution. The test on the actual solution under different turbidity disturbances shows that the 1D-CNN model had a bias rate of less than 2% in both nitrate and COD solutions, while the worst bias rate in the PLSR method was 15%.
The main water quality parameters include chemical oxygen demand (COD), heavy metal content, nitrate nitrogen (NO3–N), dissolved organic carbon (DOC), and turbidity.2,3 There are different methods for determining water quality parameters including chemical, biological, and physical methods, among which, spectroscopy is a frequently-used method to identify substances and determine conduct quantitatively through the absorption spectra.4,5 Because water analysis using standard laboratory methods often requires a longer processing time, such as sample pre-treatment or adding reagents,6 the method of absorption spectroscopy combined with advanced optical sensors stands out against the laboratory chemical methods, which allows real-time water quality measurements due to the advantages of quick response, high detection efficiency, high precision and in situ measurement.7
Once the spectral data is obtained, it becomes vital to establish an accurate link between the absorption spectra and WQI. Thus, some researchers have already proposed well-established water quality analysis methods using ultraviolet-visible (UV-vis) absorption spectra. Currently, the main method of model analysis of water quality parameters is partial least squares regression (PLSR) and principal component regression (PCR). For example, Langergraber et al.8 firstly used partial least squares regression (PLSR) to develop a quantitative model between absorption spectral peaks and WQIs. Tiecher, T. et al.9,10 applied improved PLSR and SVM methods to the quantification of sediment source contributions based on the UV-vis spectrum. Li et al.11 proposed the principal component analysis (PCA) to the UV-vis spectrum for detecting water quality contamination. However, these methods are often poorly accurate in actual surface water applications due to turbidity disturbance. Thus, it is often necessary to perform the pre-treatment to remove the effect of turbidity on absorption spectra. Hu et al.12 have proposed a novel method of turbidity compensation based on the law of Mie-scattering. However, the Mie-scattering theory applies mainly to the particle whose diameter ranges from submicron to micron. For the particles below a micron to a nanometer in diameter, Rayleigh scattering theory is usually used to evaluate light scattering. Fraunhofer diffraction theory is often applied to the particle larger than a micron to a millimeter in diameter.13,14 In actual waters, the effect of turbidity on absorption spectra is quite complex due to the complexity of the particles' diameter,15,16 and the absorbance caused by turbidity varies at different wavelengths.17,18 In addition, at different solute concentrations, the contribution of turbidity to absorbance shows variation due to non-linear deviations in absorbance caused by changes in the total absorbance of the solution.19,20 As a result, there will always be some deviations in the spectral turbidity compensation calculated by theoretical methods.
With the development of machine learning techniques, some simple machine learning algorithms are often applied to the spectrometric quantification of solute to calibrate the non-linear deviations in the absorption spectra. For example, Feng et al.21 proposed a new approach for detecting aqueous phenolic contaminants by combining wavelet analysis and Support Vector Machine (SVM). Lu, Y. et al.22,23 successfully detected chlorpyrifos and carbendazim residues in the cabbage from visible-near-infrared spectra using both SVM and PLSR methods. SVM regression methods are simple and effective in detecting patterns in complex and non-linear data.24,25 However, when the dataset is large and the data dimensions are high, the research shows that the neural network can outperform SVM.26 Spectra-characteristic data under random turbidity disturbance scenarios can be very large both in scale and dimension. Our survey shows that the CNN model always outperforms the PLSR model under conditions of large size scale of the dataset. Ng, W., et al. applied the CNN model using a total of 14594 samples of visible/near-infrared (vs-NIR), mid-infrared (MIR), and their combined spectra to characterize all soil properties.27 The results showed that compared to the PLSR model, the CNN model provides an average improvement prediction of 33–42% using vis-NIR and 30–43% using MIR spectral data input. Another CNN model for NIR spectrum calibration was investigated by Cui, C. H., et al. using the datasets containing 6998, 1000, and 415 training and 618, 597, and 108 validation samples, respectively from different sources.28 Results indicated that compared to the PLSR model the root-mean-square error of prediction (RMSEP) of the CNN model was reduced from 0.094 to 0.085, and the noise level was reduced from 0.165 to 0.036. However, when the size scale of the training dataset is small, the PLSR method may outperform the CNN method. For example, Wu, X. J., et al. established a 1D-CNN quantitative identification model based on Raman spectra for olive oil.29 The results showed that the RMSEP of the CNN model was increased from 0.4594 to 0.7183 compared to the PLS model, which demonstrates the lower prediction accuracy of the CNN model. In this paper's case, the scale of training and test dataset is over 200000. Therefore, neural network method is chosen to complete the tasks of spectrum feature extraction and solute concentration regression. The advantages and disadvantages of the modelling algorithms of spectral data is show in Table 1. In summary, SVM can only solve the problems in small samples and PLSR is a supervised learning method that can offer an alternative to PCR, which works better on solving nonlinear data compared to PCR. Taking the large scale of dataset and nonlinear characteristic of the spectrum data, PLSR method is chosen for comparison of the CNN method.
Algorithm | Algorithm principle | Advantages | Disadvantages |
---|---|---|---|
Partial least squares regression (PLSR) | Based on the maximum information supervised by the response matrix, reflecting data variation, the regression equation between variables is established | It is simple to calculate. It has high precision and a small overall deviation | It has a large local deviation and less independent variable deviation information |
Principle component regression (PCR) | Based on the construction of a regression model using the principal components filtered by the PCA method as features, the original variables are replaced with the new model based on the score coefficient matrix | It is simple to solve multi-collinearity problems. It has a fast running speed | It is difficult to solve nonlinear data |
Support vector machine regression (SVR) | Realized by constructing a linear decision function in high dimensional space after dimension increasing | It can solve high-dimensional feature data and work well on solving nonlinear data | It is not suitable for a large sample size and a large calculation amount |
The best way for the neural network to obtain a better generalization capability is to train the model with a more extensive and comprehensive dataset. Therefore, it can be concluded that the quality of the training dataset, to a large extent, determines the quality of the final training results. Dataset augmentation is a particularly effective way to improve model performance for specific categorical regression problems.30 It can easily simulate spectral images under different turbidity disturbances based on experimental measurement results rather than theoretical calculation results. The problem of water quality analysis under turbidity disturbance can be seen as a classification and regression problem under random spectral noise. However, the neural network is proven not to be quite robust to noise.31 Therefore, one way to improve the robustness of a neural network to turbidity noise is to add random turbidity noise to the network input before training. Thus, to obtain a better performance of modeling under the random disturbance of turbidity, a designed dataset augmentation method is deployed in this paper.
In this paper, we aim to optimize the neural network structure for the relatively long UV-vis absorption spectra data. Meanwhile, to improve the model's ability to resist turbidity disturbance, a new dataset augmentation method for absorption spectra of nitrate and COD under different turbidity disturbances was applied. The main difference between our present work and the prior studies is that the turbidity interference problem was solved by the combination of a data augmentation method and a convolutional neural network without turbidity removal pre-treatment. Finally, the solution concentration regression results of the designed neural network were evaluated and compared to that of the PLSR method.
(1) |
Fig. 1 The implemented spectrophotometer platform incorporates a linear CMOS image sensor chip and FPGA microcontroller (a) and a schematic view of the fused double-beam structure (b). |
The application of modern convolutional neural networks (CNN) generally consumes billions of parameters, which leads to a tremendous space complexity of the network model. Thus, it is crucial to utilize parallel computing technology to realize the acceleration and lightweight of the CNN model deployment. A method based on hardware improvement has been applied in this research. Moreover, the convolution procedure itself can be accelerated by choosing a proper convolution algorithm.
When a convolution kernel with dimension length d can be expressed as the exterior product of d vectors (each dimension contains one vector), this kernel is called the separable kernel. Applying the naïve convolution method to the separable kernel is quite inefficient for the final convolution result is equivalent to the combination of d one dimension convolution (each convolution using one of the d vectors). The combination method is faster than using the exterior product of a convolution kernel with dimension length d. Meanwhile, it takes fewer parameters to express the kernel into the vectors. For example, if the kernel's width is w elements in every dimension, the space and time complexity of the naïve convolution method is O(w^d), while the space and time complexity of the separable convolution method is O(w × d).
A 1-dimensional convolutional neural network (1D-CNN) deep learning (DL) architecture inspired by ref. 34 was used for model training and testing. A summary of the architecture is presented in Fig. 3, where 6 feature extraction layers were created with one spectrum input layer and one final regression result output layer. The spectra data is recognized by identifying absorption peaks' position and intensity, which, in general, are features with a large difference. To identify such features, maximum pooling was used for all pooling, and the Relu function (Relu(x) = max(x,0)) was used for the activation function. A max-pooling layer was added after each 1-D convolution block to reduce the dimension of each layer's output data as well. Each feature extraction layer contains a 1-D convolution block whose output dimension increases from 16 to 128, followed by the max pooling layer whose pool size and strides were both 2. With this combination in each layer, all features were retained, and then only the most “important” features in the local region are retained by pooling to achieve the purpose of down-sampling, and the obtained features are intuitively more accurate. Due to the relatively long spectral sequence (2600 dimensions), a double-channel-based structure was implemented to the 1D-CNN block to reduce the space and time complexity of the convolution calculation. Some tricks were also adopted to the model: for each convolution kernel in the 1-D convolution block, the kernel size was set as 3, dilated convolution was used, and the dilation rate was 1, 2, 4, respectively in each convolution kernel, and the number of 1-D convolution blocks was as large as possible. Another convolution kernel was used whose size was set as 1 to reconcile the input and the output of the block so that their shape remains consistent. For the nitrate dataset, 200000 spectra samples were further split into 199000 training and 1000 test sets using the “test_train_split” function from “SciKit-learn”. And for the COD dataset, 280700 spectra samples were split into 279700 training and 1000 test set.
The model weights were optimized with an adaptive Adam optimizer. The mean squared error and coefficient of determination were used as the loss function and accuracy function to train the network. A batch size of 20 was used to get the best performance of the training procedure35 and each model was trained up to 50 epochs with 500 steps per epoch. After every 50 epochs were finished, the learning rate was lowered by using different Adam optimizers. To have a fair comparison for different pre-processing techniques, the same architecture settings were used.
All analyses were carried out using Tensorflow GPU 2.6.0 using the dual Geforce RTX 3090 (Nvidia Corporation, Santa Clara, California, USA), under the environment of CUDA 11.2, using a small server computer equipped with a 2.20 GHz Intel® Xeon® Silver 4210 CPU (Intel Corporation, Santa Clara, CA) and 32 GB RAM, running Ubuntu 9.4.0 operating system and python 3.8.2.
Due to the non-linearity of Lambert's law, the effect of turbidity on the absorption spectra cannot be simply understood as the result of superimposing the turbidity spectrum on the solute spectrum.36 In other words, the effect of turbidity on the absorption spectra can vary at different solute concentrations. To better understand the influence of turbidity brings to the absorption spectra under different solution concentrations, the absorption spectra of the substances before and after the effect of turbidity at each concentration were measured, and the difference between the two spectra was made to assess the contribution of turbidity to the spectra at different solution concentrations. The difference between the two spectra is called Turbidity Residual Spectra (TRS). Following the same procedure, the discrete TRS at each concentration was calculated, and the results are shown in Fig. 4.
Fig. 4 Discrete dataset of turbidity residual spectra under COD concentration of 10 mg L−1 (a), 30 mg L−1 (b), 50 mg L−1 (c), and 70 mg L−1 (d). |
The turbidity residual spectra dataset was then expanded by applying the cubic spline interpolation method in the function “interpolate.interp1d()” in the python library “SciPy”. As depicted in Fig. 5, to obtain a continuous TRS dataset, the interpolation operation was used twice respectively for turbidity concentration filling and COD concentration filling, so that the corresponding TRS data ranges from 0–90 NTU can be found regardless of the COD concentration (0–70 mg L−1). The interpolation operation was first applied in the turbidity values at each separate COD concentration and then applied in the COD concentration at each turbidity value interpolated before. The final continuous dataset structure is depicted in Table 2.
Fig. 5 Interpolation operation used twice respectively for turbidity concentration filling (a) and nitrate concentration filling (b). |
As depicted in Fig. 6, the discrete TRS dataset under different nitrate solutions was calculated by following the same steps. Then the dataset was expanded using the same interpolation method, the interpolation result is shown in Table 3.
Fig. 6 Dataset of turbidity residual spectra under nitrate concentration of 0 mg L−1 (a), 1 mg L−1 (b), 2 mg L−1 (c), 3 mg L−1 (d), 4 mg L−1 (e), 5 mg L−1 (f). |
To further validate the accuracy of the dataset augmentation method, the absorption spectra of a 5 mg L−1 nitrate and a 50 mg L−1 COD solution were then tested at turbidity of 45 NTU, and the test results were compared with the simulation spectra using the indicators of standard bias. The standard bias at each wavelength is calculated using the following formula:
(2) |
Fig. 7 Interpolation spectra and true spectra of nitrate (a) and COD (c) and normalized bias between the interpolation spectra and true spectra of nitrate (b) and COD (d). |
A distribution summary of reference nitrate and COD solution values in test and training sets is presented in Fig. 8. The dataset was produced by adding the corresponding turbidity residual spectra to the nitrate/COD absorption spectra at different concentrations. The nitrate and COD concentration of the test set was well represented in the method of random sampling using the ‘sample’ function in the python library of “pandas”.
Fig. 9 Trends in loss (MSE) and accuracy (coefficient of determination) of the model during the training procedure. |
The best-performing models were saved after the training process was completed. To further evaluate the performance of the model, 1000 test sets of COD and nitrate spectra data produced in 3.1 were sent into the two corresponding models. In Fig. 10 and 11, the comparison was made between the PLSR and 1-D CNN method for predicted regression results of COD and nitrate solution concentration using the test sets. The true values and the predicted regression values using the two different methods are directly shown in Fig. 10(a), (c), 11(a) and (c). If the round point and the triangle point in the figure overlap, it can be concluded that the regression is highly accurate. Thus, as can be seen from the figure, in both COD and nitrate solutions, the bias between the true values and the predicted regression values in the CNN model was much smaller than it was in the PLSR model. To further quantify the model regression accuracy, a scatter plot was drawn using the true values as the horizontal axis and the predicted regression values as the vertical axis, and the linear analysis between the true values and predicted regression values were made as shown in Fig. 10(b), (d), 11(b) and (d). It can be seen from the scatter plot that the higher the regression accuracy is, the closer to one the slope of the fitting line is. Thus, the accuracy of the model can be evaluated by checking the linearity and the slope in the linear analysis result. The R-square in the CNN model for the linear analysis of nitrate and COD solution was 0.99943 and 0.99946, while in the PLSR model was 0.98869 and 0.99407. The slope in the CNN model for the linear analysis of nitrate and COD solution was respectively 0.99575 and 1.00021, while in the PLSR model was 0.98804 and 0.9935. Consequently, both indicators of slope and R-square reveal that the CNN model performs higher accuracy in nitrate solution and COD solution under random turbidity disturbance.
Moreover, RRMSEP was also used as an evaluation indicator for the regression model,37,38 the RRMSEP can be calculated using the following formula:
(3) |
(4) |
To compare the accuracy of the two models more intuitively, it is defined that the group, whose bias between the true values and the predicted regression values is less than 5%, is the group with the correct judgment. The comparison result for the nitrate and COD solution is depicted in Fig. 12. The figure shows that the accuracy of nitrate concentration regression was increased from 56% of the PLSR model to 93% of the CNN model and the accuracy of COD concentration regression was increased from 68% of PLSR model to 91% of CNN model.
Fig. 12 Accuracy comparison between PLSR and 1-D CNN method for the result of nitrate (a) and COD (b) solution concentration regression using the test sets. |
When applying the PLSR algorithm, the non-linearity appeared between the predicted regression and true values of both COD and nitrate solutions at high-concentration groups. This is due to the deviation from linearity in the absorption spectrum peak at high solute concentrations, which is caused by the CMOS image sensor showing different photoelectric conversion efficiencies at different light intensities and different wavelengths.39 On the other hand, the CNN method learned the non-linear deviation in the absorption spectrum peak generated by the hardware system of the spectrometer from a large dataset and successfully corrected this deviation.
To better showcase the performance of our CNN method, the regression results of other non-linear methods such as SVR and k-nearest neighbor (KNN) method were added to the Table 4. Radial basis function (RBF) is used as the kernel function of the SVR method to handle the non-linear problems. The results indicated that the SVR is the most time-consuming method while processing the long spectral data in the large size scale of the dataset, while KNN is the most time-saving method at the cost of a low accuracy rate.
Algorithm | Average RRMSEP | Average result accuracy rate | Processing time (training time included) |
---|---|---|---|
Partial least squares regression (PLSR) | 5.3% | 61.5% | 657 seconds |
Support vector machine regression (SVR) | 4.6% | 86.4% | 1569 seconds |
k-Nearest neighbor (KNN) | 5.5% | 59.4% | 235 seconds |
Double-channel 1-D convolution neural network (1-D CNN) | 1.3% | 92% | 1395 seconds |
Finally, the absorption spectra of nitrate solutions and COD solutions blended with 10, 50, and 80 NTU turbidity solutions were tested. The spectrum data was sent separately into the trained 1D-CNN model and PLSR model. The regression results of two different models are depicted in Fig. 13. The error bar in Fig. 13(a) and (b) represents the error level of three regression results (under 10, 50, 80 NTU) in each concentration. The bias in Fig. 13(c) and (d) calculated by the formula (2) represents the relative deviation between the true values and the predicted regression values of the solute. The results indicate that the CNN had an error rate of less than 2% for both nitrate and COD concentration in actual solution tests. In comparison, using the PLSR method, the worst bias rate in actual solution tests was 15%.
From the demonstration of the modeling process and the analysis of the regression results, it can be concluded that the advantage of the proposed CNN method is that by combining the dataset augmentation method and CNN method, the turbidity interference is successfully excluded during spectrometric quantification of nitrate/COD solution without any spectral pre-treatment. However, the proposed method still has its drawback, that its application scenarios are highly dependent on the training dataset.
The augmented training dataset was then fed into the PLSR model and the CNN model. The training results of the two models were compared. The results show that the R-square value for the PLSR model between the true values and predicted regression values of nitrate and COD solution were 0.98869 and 0.99407, while they were 0.99943 and 0.99946 for the CNN model. Compared to the PLSR model, the value of RRMSEP for the CNN model was reduced from 6.1% to 1.4% in nitrate solution and from 4.5% to 1.3% in COD solution, and the accuracy of the regression result for the CNN model was increased from 56% to 93% in nitrate solution and 68% to 91% in COD solution. At last, the absorption spectra of prepared nitrate and COD solutions in more groups blended with different turbidity solutions were tested. The test result shows that the 1D-CNN model performed an error rate of less than 2% in both nitrate and COD solutions, while the worst bias rate in the PLSR method was 15%. All the test results indicate that the 1D-CNN model can successfully extract and quantify the characteristic information of nitrate/COD solution from the absorption spectra under random turbidity disturbance and the 1D-CNN model showed higher regression accuracy compared to the PLSR method.
This journal is © The Royal Society of Chemistry 2023 |