Peng Li‡
a,
Jiangbei Qu‡a,
Yiliang He*ab,
Zhang Boa and
Mengke Peia
aSchool of Environmental Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China. E-mail: ylhe@sjtu.edu.cn
bChina-UK Low Carbon College, Shanghai Jiao Tong University, Shanghai 200240, PR China
First published on 1st June 2020
In recent years, rural sewage treatment facilities have grown rapidly in China, and yet the water quality of the effluent has not been well monitored. The detection of chemical oxygen demand (COD) via ultraviolet-visible (UV-Vis) spectroscopy is an emerging technology with advantages of low cost and easy maintenance, which make it appropriate for the on-line monitoring of effluents from rural sewage treatment facilities. Because there are numerous sewage treatment devices in rural regions and as their locations are usually very scattered, it is difficult to calibrate the COD estimation model for each monitoring site. Hence, a COD estimation model with global calibration is a specific problem for application in rural regions. However, little research was performed on real rural sewage, yet much is desired in terms of the model accuracy and robustness. Consequently, a practical COD detection method with UV-Vis spectroscopy was established in this study. The COD estimation model was globally calibrated with effluents from rural sewage treatment devices. In order to avoid misleading data for evaluating the model performance caused by the differences in the COD concentration range of training sets, two new criteria, namely the Root Mean Square Relative Error (RMSRE) and Relative Error Variance (REV), were proposed to evaluate the model accuracy and robustness. Differences in the organic composition as characterized by excitation–emission matrix (EEM) fluorescence spectroscopy were shown to significantly affect the accuracy of the global calibration model. Through comparison among the methods of the partial least squares (PLS), support vector machine (SVM), and back-propagation neural network, PLS was verified to be able to attain sufficient accuracy and to be suitable for applying to the modeling with global calibration. A simplified modeling method was proposed to replace the absorption spectra at the full wavelength band with the absorbance at some specific wavelengths that were selected by interval partial least-squares regression (iPLSR) and synergy interval partial least-squares regression (siPLSR). In this study, the simplified model was proven to be reliable with three specific wavelengths: 251, 356, and 363 nm. An on-line COD monitor utilizing UV-Vis spectroscopy was thus developed for combining with the global calibration model.
Chemical oxygen demand (COD) is a key indicator of the concentration of reductive contaminants, and is considered to directly reflect the pollution level, and so on-line monitors for COD are indispensable for the regulation of discharges.5 The conventional COD on-line monitor is based on a method involving oxidization by potassium dichromate;6 however, the device is very expensive and needs high maintenance costs. Taking HACH as an example, the device price is more than USD 17000, and the reagent cost is around USD 0.5 for each sample. Besides, the detection requires extra toxic chemicals and takes more than 2 hours.5 Spectroscopic analysis, such as UV-Vis spectroscopy,7 fluorescence spectroscopy8 and near-infrared spectroscopy,9 have been used to assess the wastewater quality, especially for predicting the COD concentration. These monitoring methods are rapid, non-destructive, and environmentally friendly. Meanwhile, the costs of the device utilizing optical techniques for COD monitoring are much cheaper than those of conventional ones.
The UV absorbance at a specific wavelength, such as 254 nm (A254),10 or at multi-wavelengths in a narrow range,11 were at first adopted to characterize the COD of wastewater. However, the COD concentration prediction was sometimes inaccurate because insufficient information was obtained from these limited wavelengths. Especially when the model were applied in different sampling sites, the relative deviation could reach as much as 45–50%. Many researchers attempted to use the UV-Vis absorption spectra of broader wavelengths, usually from 200 nm to 700 nm, for the construction of COD estimation models in wastewater quality monitoring.7,12 On-line and in situ UV-Vis spectrophotometers have been extensively applied in recent years,5,13,14 and could be a promising on-line COD monitoring technology too. Although the UV-Vis spectroscopic model, with increasing information available benefiting from much broader spectra, has been identified as a possible route to improve the performance of COD estimation in wastewater, the accuracy and robustness of the predictive results still leave much to be desired.
COD estimation models in previous studies were mainly calibrated by samples with a relatively constant composition, such as simulated wastewater prepared in a laboratory,15 water samples from the same site as the wastewater treatment plant16 or surface water,17 which is called local calibration.18 However, when a model set up with local calibration is applied to another monitoring site, the accuracy of the prediction results would decrease due to the differences in the organics composition. Therefore, the global calibration, which refers to calibrating a model with samples from different sampling sites in the target area,18 should be adopted to promote the application of one model to many more monitoring sites. In terms of an on-line COD monitor for rural sewage treatment facilities, a global calibration model is appropriate due to the difficulty to calibrate the model for each treatment facility. Clearly, it is necessary to quantitatively assess the model accuracy used in global calibration and to verify the feasibility of the developed COD detection method. However, the feasibility of doing this has still rarely been investigated, and it might be impacted by differences in the organics composition when monitoring different treatment devices. Many statistic methods have been adopted to establish a COD estimation model with UV-Vis absorption spectra, including the traditional statistic methods (Linear Regression (LR), Partial Least Squares (PLS)), and the machine learning methods (Support Vector Machine (SVM), Back-Propagation Neural Network (BPNN)).18 Some researchers proved that the linear methods, such as PLS and LR, are accurate enough to find a correlation between the spectral data and COD concentration according to the Lambert–Beer law.19 While the machine learning methods were also found to offer good performance.20 However, there is no consistent conclusion yet on the best modeling method, especially for the global calibration modeling of rural sewage. Given the differences in the organics composition in waters, a suitable modeling method might need to be specially investigated.
In previous studies, the quality of the COD estimation models were evaluated by a series of criteria, such as R2, root-mean-square error (RMSE), root mean squared error standard deviation ratio of observations (RSR), and range to error ratio (RER).21 However, these criteria can be influenced by the sample quantity or concentration range and lead to misleading conclusions in some conditions. For example, the frequently used coefficient of determination (R2) may produce high values in poor models, and it is insensitive to additive and proportional differences and is also oversensitive to high extreme values.22–24 No universal criteria have been established yet to fairly assess the predictive performance of different models.
The COD estimation method utilizing UV-Vis absorption spectroscopy usually uses a wavelength in the range of 200–700 nm,25 so that it needs to be equipped with a broad wavelength light source, such as a xenon lamp. Meanwhile, some researchers used different wavelength regions, less than 200–700 nm, to construct a model for different wastewaters, while still guaranteeing the accuracy and robustness of their results.5,12 The full-spectrum could be replaced by the absorption of a representative wavelength region, maintaining an adequate accuracy and robustness. Hence, a series of narrow band light sources could replace the broad band ones, and the cost of the device could decrease. However, few existing studies have focused on the selection of the optimal wavelength region to date.
The COD estimation method utilizing UV-Vis absorption spectroscopy needs comprehensive investigation. The aims of this study included: (1) to propose a practical COD detection method utilizing UV-Vis spectroscopy; (2) to evaluate the modeling methods in terms of accuracy and robustness with samples from rural sewage treatment facilities; (3) to determine the representative wavelength regions with interval partial least-squares regression (iPLSR) and variable selection; (4) to verify the feasibility of the developed COD detection method with a lab manufactured on-line COD monitor based on UV-Vis absorption spectroscopy. All the studies were performed with an aim to set up a global calibration for the optimization of COD measurement, which can then be applied to any wastewater, not just the effluent from biotreatment systems used for rural sewage.
Three methods for model construction for COD estimation, including partial least squares (PLS) regression, support-vector machines (SVM), and back-propagation neural network (BP-NN), were used and compared with the same data in this study. PLS regression, as a statistical method that bears some relation to principal components regression, includes a cross-verification procedure and identifying outliers. Also, the optimal number of components can be automatically confirmed by the software. SVM is a supervised learning model with associated learning algorithms that analyze the data used for classification and regression analysis. SVM regression and classification are very useful in order to detect patterns in complex and non-linear data,27 and the method is used in conjunction with the leave-one-out cross-validation program (LOOCV), with the squared difference between the observed and SVM regression estimated data being the objective function to determine the most predictable SVM method. Both the PLS and SVM method were performed using Unscrambler X 10.4. The BP-NN is one of the artificial neural networks that adds the concept of backpropagation to artificial neural networks. It can create both a linear regression model and non-linear regression model with satisfying properties. The Sigmoid function was selected as the kernel function of the BP-NN, and the BP-NN model was conducted by the AMORE package in R software.
(1) |
(2) |
(3) |
The criteria of the COD estimation model, including R2, REV, RMSE, and RMSRE, were calculated with various data sets that had different COD concentration ranges. Through comparing the criteria values obtained from the different data sets, the influence of the COD concentration range on the criteria could be investigated. The results are shown in Table 1, in which data sets A1, A2, and A3 were collected from different sampling sites, and correspondingly, the COD concentrations ranged from 20 to 100 mg L−1, 20 to 200 mg L−1, and 70 to 150 mg L; Data set C was collected from one sampling site at different times. When the model was calibrated using Data set A1, the R2 reached 0.843 and RMSE was 6.622. When calibrated using Data set A2, the R2 increased to 0.949 and RMSE increased to 9.972. According to the criteria of R2, the predicted performance of the model seemed to be optimized from A1 to A2, but when assessed by the criteria of RMSE, the opposite conclusion could be reached. However, for a reliable model, the performance, including accuracy and robustness, should be approximately uniform in each COD concentration region. The fact was that neither the measuring instruments nor the modeling methods were improved, but the calibration data were different based on the scale of the COD concentration range for A1 and A2. It was obvious that the scale of the COD concentration range could affect the R2 and RMSE of the model and possibly lead to misleading conclusions. Similar results also could be found in previous reports. Charef et al.29 established a COD estimation model for municipal sewage, and the R2 of their model was 0.95 using the samples with COD concentration ranging from 112 to 422 mg L−1, while Langergraber et al.7 obtained an R2 of 0.978 when the COD concentration ranged from 38 to 568 mg L−1 in similar studies. Due to the potential possibility of excessive fitting, it would be imprecise to assess the model based on R2. Taking the influence of the data set range into account, REV and RMSRE were proposed as model evaluation criteria. As shown in Table 1, both REV and RMSRE were very close between the models calibrated using A1 and A2, indicating that the interference of the scale of the COD concentration range could be avoided when assessing the model with REV and RMSRE.
Date set | A1 | A2 | A3 | C | ||||
---|---|---|---|---|---|---|---|---|
Cal. | Val. | Cal. | Val. | Cal. | Val. | Cal. | Val. | |
RMSE | 6.622 | 6.508 | 9.972 | 10.725 | 12.121 | 12.988 | 5.170 | 5.684 |
R2 | 0.843 | 0.887 | 0.949 | 0.921 | 0.743 | 0.790 | 0.933 | 0.876 |
RMSRE | 0.141 | 0.137 | 0.137 | 0.143 | 0.138 | 0.140 | 0.104 | 0.106 |
REV | 0.0187 | 0.0199 | 0.0193 | 0.0178 | 0.0178 | 0.0183 | 0.011 | 0.0116 |
As shown in Table 1, the R2 of the model calibrated using Data set A1 was larger than that calibrated using Data set A3, while the RMSE of the former was smaller than that of the latter. The results appeared to indicate that the performance of the model was worse in the COD concentration range of 70–150 mg L−1. However, as previously mentioned, because the model has been confirmed to be reliable in the target range of COD concentration, the performance was supposed to be approximately uniform. Hence, R2 would not be affected by the change in the observation value range according to the statistics. Comparing data sets A1 (20–100 mg L−1) and A3 (70–150 mg L−1), they were the same in the scale of COD concentration range, but different regions. Hence, besides the scale of the COD concentration range, the R2 and RMSE were proved to be affected by the region too. In terms of the RMSER and the REV, both the models had the same accuracy and robustness. Consequently, the performance of the predictive models would not be affected when the COD concentration regions have the same width but different values.
Data set A2 were obtained from the water samples collected from different devices, while Data set C were from one device taken at different times. The organic composition of the samples in Group A2 was supposed to be more diverse than that of Group C. The performance of the model could be better when calibrated using samples with a constant organic composition. Zhao et al.30 used a dilution of the potassium hydrogen phthalate and established the relationship between the COD concentration and A254–A546, and the R2 was up to 0.995. It is easy to deduce that the performance of the model calibrated using Data set C would be better than that of Data set A2. The criteria of RMSE, REV, and RMSRE confirmed this hypothesis. However, the R2 of the model calibrated using Data set A2 was very close to that calibrated using Data set C, indicating the misleading results when using the criterion of R2.
When the COD estimation method by UV-Vis spectroscopy was applied in practical engineering, the influences of various factors, such as the COD concentration range and the organic components in the water, should be considered. Many researchers stated that the criterion of the coefficient of determination (R2), which is frequently used, might reach to a very high value, even in a poor model. The RMSE is related to the COD concentration range. Hence, based on the criterion of RMSE, the accuracy could not be compared between the models that were calibrated using the data sets with different COD concentration ranges. It was proved that this problem could be solved by RMSRE. The REV has also been proved to be a suitable criterion for evaluating the robustness of the model. Hence, the two new criteria can better reflect the predictive performance of the model. RMSRE is used to evaluate the prediction accuracy of the model. Compared with RMSE, it will not be affected by the size of the measured value. It can be used to evaluate the predictive performance among different models. REV is used to evaluate the predictive robustness of the model. The calculation formula is simple and the evaluation effect is real and reliable. It is thus suitable to choose the RMSRE and REV to evaluate the performance of the model.
Methods | PLS | SVM | BP-neural network | |||
---|---|---|---|---|---|---|
Cal. | Val. | Cal. | Val. | Cal. | Val. | |
RMSE | 11.03 | 10.384 | 10.88 | 11.472 | 11.797 | 10.650 |
R2 | 0.949 | 0.945 | 0.953 | 0.931 | 0.942 | 0.979 |
RMSRE | 0.159 | 0.151 | 0.1625 | 0.1737 | 0.1696 | 0.1495 |
REV | 0.023 | 0.0224 | 0.0237 | 0.025 | 0.0238 | 0.0213 |
The choice for the best modeling method needs to take both the mathematical complexity and the performance into consideration, and the model performance in terms of accuracy and robustness might be affected by the realistic practical conditions. There has not yet been a universal modeling method reported with acceptable performance for different kinds of wastewater. Establishing a specific model for each type of wastewater may thus be the most efficient way. In this study, the PLS method was found to be suitable for estimation modeling of the COD concentration in the effluent of biotreatment devices used for rural sewage.
Data set | A | C | ||
---|---|---|---|---|
Cal. | Val. | Cal. | Val. | |
RMSE | 11.990 | 10.720 | 5.270 | 4.640 |
R2 | 0.943 | 0.934 | 0.928 | 0.955 |
RMSRE | 0.139 | 0.109 | 0.119 | 0.099 |
REV | 0.019 | 0.011 | 0.014 | 0.008 |
The organic compositions of the samples in Group A and Group C were investigated. Most organics in the effluent of sewage biotreatment devices would emit fluorescence under excitation conditions. The characteristics of excitation–emission fluorescence depend on the organic types. Hence, the fluorescence excitation–emission matrix regional integration (FRI) could be used to analyze the organics composition in water samples.34 Excitation–emission matrix (EEM) fluorescence spectroscopy analysis of the water samples was divided into five regions, where every region represented a different type of compound. The proportions of the five types of organic compounds in Group A and Group C are shown in Fig. 2. The main substances in the samples include soluble microbial by-product-like substances (10–20%), fulvic acid-like substances (10–20%), and humic acid-like substances (60–70%). The variances of the proportion of humic acid-like, soluble microbial by-product-like, and fulvic acid-like substances were 0.0039, 0.0006, and 0.00147 for Group A, and correspondingly 0.0005, 0.00013, and 0.00017 for Group C. It was obvious that the differences in the organic composition in the samples from Group C were less than that from Group A. As the water samples of Group C were from the same biotreatment device that was operated in a stable condition, the organics composition was correspondingly stable. Therefore, in terms of the differences in organic composition among the water samples, that of the PHP solution was less than that of Group C, and the latter was less than that of Group A. Corresponding to the model performance for each group, it was thus obvious that the organic composition in the water samples would significantly affect the model performance. A higher accuracy and robustness of the COD estimation models would be obtained from a lower difference in organic composition in both the calibration and validation samples.
In the previous studies, the models were calibrated with simulation wastewater samples prepared in a laboratory or with samples from one sampling site,15–17 which were similar with the research conditions of the PHP solution and Group A in this study. A method that is calibrated with a specific organic composition in samples is called local calibration.7,18 According to the previous reports, local calibrated models could usually achieve satisfactory performances, benefitting from the relatively consistent organic composition.35,36 Hu et al.36 proposed a local calibration method with samples from four factories, and found the prediction accuracy was successfully improved. However, if on-line COD monitors were to be equipped on rural sewage treatment devices, it would be very difficult to calibrate each COD estimation model for each monitoring site, because in the rural region, the amount of sewage treatment devices is very large and their locations are scattered. Therefore, the model can only be calibrated with samples from different treatment devices, which is called global calibration.7,18 Because the organic composition of the samples for calibrating is inconsistent, the model accuracy is usually worse for global calibration than for local calibration. In this study, the COD estimation model was globally calibrated with samples from different treatment devices. The model accuracy was found to be acceptable with the RMSRE of 0.139. Although the accuracy was less than that using the chemical oxidation method, it was enough for monitoring the effluent of rural sewage treatment devices. Using a global calibration model to predict COD with UV-Vis spectroscopy has been verified as a reliable method in practice.
The UV-Vis absorption spectra of all the samples with different COD concentrations are shown in Fig. 3. The difference in absorbance among these samples was very little at the wavelength beyond 400 nm. Furthermore, the absorbance at the wavelength beyond 600 nm dropped to small enough a level to be ignored for all the samples. There were obviously some non-COD-related wavelengths in the range from 200 to 700 nm. The absorbance at these wavelength should be eliminated during modeling so that the model would avoid such interference and to simplify the calculation.
The full UV-Vis absorption spectra of the water samples from Group A were equally split into 30 intervals in the wavelength range from 200 nm to 700 nm. The COD estimation models were established by PLS using the spectra of each interval and their combinations. Using the methods of iPLSR and siPLSR, the best interval combination could be screened from all possible combinations with the same amount of intervals. The optimal models using each amount of interval from one to thirty were established with the best interval combinations that were screened by iPLSR and siPLSR. The model performance was evaluated by the criteria of RMSRE and REV. The RMSRE and REV of the optimal models established by one, two, three, four, five, and all the thirty intervals are shown in Table 4. The optimal model was found to be established with the data of three interval combinations, in which the lowest REV and RMSRE were acquired. The optimal wavelength intervals were located in the near ultraviolet range from 200 nm to 400 nm, including 251–268 nm, 319–336 nm, and 353–370 nm, respectively. It has been confirmed by previous studies that the general saturated organic compounds would have no absorption in the near ultraviolet region, while those containing conjugated double bonds or a benzene ring would have obvious absorption in the ultraviolet region or a characteristic peak.36 The main absorption wavelength of simple aromatic compounds containing a benzene ring is in the range of 250 nm to 260 nm.12,36 Besides, the UV absorbance is positively correlated with the molecular weight of an organic compound.37 Hence, the specific intervals identified by iPLSR and siPLSR could also bring information on the organic composition to some extent.
RMSRE | REV | |||
---|---|---|---|---|
Cal. | Val. | Cal. | Val. | |
One interval | 0.17 | 0.16 | 0.025 | 0.022 |
Two intervals | 0.13 | 0.135 | 0.018 | 0.0163 |
Three intervals | 0.124 | 0.133 | 0.015 | 0.0165 |
Four intervals | 0.146 | 0.14 | 0.021 | 0.02 |
Five intervals | 0.152 | 0.148 | 0.022 | 0.024 |
All data | 0.121 | 0.12 | 0.014 | 0.0135 |
For the absorption spectra in a narrowband range, the absorbance at some wavelength was observed to be linearly correlated with those at its adjacent wavelength, and the R2 was even more than 0.95. Therefore, a series of specific wavelengths could be selected, by which the absorbance could replace the spectral data to establish the model without significantly reducing the model accuracy.38 In this study, lasso regressions and stepwise regression were used to select the specific wavelengths from the optimal wavelength intervals. Using the lasso regressions, the specific wavelengths were discovered to be 251, 356, 357, 362, and 363 nm. The stepwise regression was subsequently implemented, and the minimum AIC was 123.24. The specific wavelengths were finally concluded to be 251, 356, and 363 nm. Three peaks were observed at these three wavelengths in the UV-Vis absorption spectra, which might be from the organics with an aromatic structure or conjugated double bond.39
When using the absorbance at specific wavelengths for modeling, the variables involved sharply decrease, so that some simple algorithms could become effective. Based on the three identified specific wavelengths, the Multiple Linear Regression (MLR) method was adopted for the COD estimation modeling. A comparison between the predicted COD concentrations and the true COD concentrations is shown in Fig. 4. The model performance was evaluated by the criteria of RMSRE and REV (Fig. 4). Compared with the value of these criteria of the model using the UV-Vis absorption spectrum (200–700 nm), the RMSRE only increased from 0.139 to 0.145 and the REV increased from 0.019 to 0.021, indicating that there was just a slight decline in the accuracy and robustness of the predictive results. Li et al.17 optimized a COD determination model using a similar method with 144 samples collected from a lake. The predicted accuracy and model stability were improved by iPLS and siPLS, which was contrary with this study. In this study, the samples were collected from more than 50 rural sewage treatment devices, so the organic composition would be more diverse here. Therefore, the different findings may be ascribed to the differences in the organic composition, but this needs further study. Nevertheless, the specific wavelength selection can simplify the modeling process and guarantee the model accuracy and robustness to a reliable level. Benefitting from this optimization, the modeling process could be simplified a lot. Meanwhile, an even more important advantage could be achieved by this optimization. The narrowband light source, manufactured with a light emitting diode (LED), could take the place of the broadband light source due to only a few characteristic wavelengths being involved in the modeling. Therefore, the price of the monitors would sharply decrease. This would be significant for the practical application of COD measurement by UV-Vis spectroscopy.
The lab-manufactured monitor was installed in the field for monitoring the effluent, and real samples were collected and tested. The predictive results are shown in Table 5. When the true COD concentration ranged from approximately 30.0 to 200.0 mg L−1, the relative error of the predicted COD was most acceptable in the range from −4.1% to 6.2%, with occasional outliers. The results indicated that the COD estimation method is effective. In addition, it should be noted that turbidity and the particles in the water will affect the UV-Vis spectrum, and the prediction accuracy would be reduced.15,40,41 The on-line monitor was used to test the effluent of sewage treatment devices in practice. During the testing period, the turbidity and the particles were constant in the effluent due to the good operating conditions. Therefore, the interference would be not significant. Nevertheless, the influence of turbidity and the particles on the UV-Vis spectrum were very crucial for COD prediction. It is necessary to comprehensively investigate this further.
True COD (mg L−1) | Predicted COD (mg L−1) | Relative error |
---|---|---|
32.0 | 30.7 | −4.1% |
40.0 | 42.2 | 5.5% |
66.0 | 83.6 | 26.7% |
79.0 | 83.9 | 6.2% |
90.0 | 93.8 | 4.2% |
130.0 | 127.3 | −2.1% |
154.0 | 160.4 | 4.2% |
59.0 | 61.1 | 3.6% |
60.0 | 71.4 | 19% |
199.0 | 193.3 | −2.9% |
One feature of the monitor was adopting a default spectrum instead of the reference beam, and only the measuring beam was reserved. The device structure became much simpler and the interference due to instability of the reference spectra could be avoided. Another useful feature is configuring a wireless data transmission module based on General Packet Radio Service into the monitor. The real-time on-line data of the UV-Vis spectra could be transmitted to the cloud platform, and treated using the COD estimation model. The predicted COD concentration could thus be obtained and transmitted to the management platform or mobile termination of relevant people. The cost of this on-line COD monitor is competitive with the conventional ones. Accounting for the cost of the prototype, only USD 5000 was needed, which is less than 30% of the price of conventional COD monitors. Taking bulk production into account, there is still a lot of space to cut back the device costs.
Benefiting from these advantages, including the easier maintenance, lower price, little operating cost, and connectability to the Internet of things, the developed on-line COD monitor could be installed in the massive number of rural sewage treatment facilities. An information management system might be thereby supported based on the developed COD monitor.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9ra10732k |
‡ Both authors contributed equally to this study. |
This journal is © The Royal Society of Chemistry 2020 |