Nabila
Nafsin
and
Jin
Li
*
Department of Civil and Environmental Engineering, University of Wisconsin-Milwaukee, WI 53211, USA. E-mail: nnafsin@uwm.edu; li@uwm.edu
First published on 12th December 2022
Urban water undergoes physical and chemical changes due to various contaminants from point sources and non-point sources, including organic matter pollution and fecal bacterial contamination. Machine learning (ML) algorithms can be used as potential tools in surface water quality monitoring due to their capacity of finding underlying patterns and non-linear relationships among water quality parameters, unattainable by traditional or process-based water quality analysis. In this study, several standalone ML models such as artificial neural network (ANN), support vector machine (SVM), gradient boosting machine (GBM), random forest (RF) and ensemble-hybrid models such as RF-SVM, ANN-SVM, GBM-SVM, RF-ANN, GBM-ANN, and RF-GBM were developed for predicting total organic carbon (TOC) and E. coli in the Milwaukee River system. The significance of the study is the application of the ensemble-hybrid models for TOC and bacterial contamination prediction for the first time, which provides a reliable and direct approach to complement existing monitoring techniques in the Milwaukee River system with satisfactory prediction accuracies. The ensemble-hybrid models for TOC prediction resulted in R2 values within a range of 0.95–0.97. However, for E. coli prediction it was difficult to explain the greater amount of unexplained variation in bacterial data based on the physicochemical water quality parameters, resulting in R2 values within a range of 0.29–0.42. The hybrid model ANN-GBM outperformed others for both TOC and E. coli with prediction accuracies of 97% and 42%, respectively. An attempt was made to explain the variability in living microorganism behavior based on specific physicochemical parameters by developing prediction models for E. coli.
Environmental significanceThere is a need to improve the water quality monitoring program with an accurate, reliable, and cost-effective method for measuring water quality parameters. Traditional approaches used for measuring water quality parameters are time-consuming and inaccurate due to the inconsistencies between the actual field condition during sampling and the lab environment. Advanced machine learning (ML) techniques have been developed for a more accurate and reliable prediction of water quality. The novelty of this study is the successful application of ensemble-hybrid ML models that were not previously used for TOC and E. coli prediction. These ML models can provide timely prediction on significant changes of TOC and E. coli levels in a river, allowing decisionmakers a quicker response in water management. |
This research focuses on the analysis of water quality in terms of TOC and fecal bacterial contamination (E. coli) in natural streams. The study area is located within the Milwaukee River basin which includes three major rivers: Milwaukee River, Menomonee River, and Kinnickinnic River flowing into the harbor of Milwaukee, Wisconsin. The Milwaukee River starts from the north of Wisconsin and flows towards the south in downtown Milwaukee and discharges into Lake Michigan. Menomonee River and Kinnickinnic River are two main tributaries of the Milwaukee River. Urban and agricultural runoff, municipal and industrial point sources, construction site erosion, stream bank erosion, stream and wetland modification, contaminated sediments, and changes in land use are the major contributors to the degradation of water quality of the river system.2,3 The urbanized Milwaukee River basin is greatly affected by urban runoff and stormwater is considered one of the most significant sources of pathogenic microorganisms.3,4
To detect surface water contamination more rapidly and accurately, early warning systems (EWS) and ML techniques have been developed. Nafsin and Li (2021)5 and Nafsin et al., (2022)6 applied statistical event detection software CANARY for the analysis of surface water quality. The application of ML techniques is useful in predicting water quality as the models can provide data-driven decisions by extracting predictive information from a large dataset. Several studies7–10 developed ML models for event detection in water distribution systems. Other studies11–16 investigated the performances of different ML models for predicting water quality parameters of natural source water. ML models have also been developed to predict the water quality index and water quality class.17–23
Several studies explored different ML techniques in predicting TOC to characterize the hydrocarbon potential of source rocks, soil, organic shale, and mudstone.24–28 However, to the best of the authors' knowledge limited studies have been made for developing TOC prediction models in natural streams. Yeon et al. (2008),29 Goz et al. (2019),30 and Kim et al. (2021)31 explored the application of ANN, kernel extreme machine learning, and extreme machine learning models with different activation functions to estimate TOC of rivers. In addition, several studies4,32–37 investigated regression-based techniques for microbial analysis of surface water and groundwater using physicochemical and hydrometeorological parameters. However, development of such predictive models for fecal indicator bacteria analysis based on physicochemical and hydrometeorological parameters is site and bacteria group specific. The survival of FIB can be affected by complex interactions among physicochemical and hydrometeorological parameters, and land use patterns of the study area.38
In this study, we applied several standalone and ensemble-hybrid ML algorithms that can potentially be very effective tools in predicting TOC and E. coli in natural streams of the Milwaukee River basin. The developed ensemble-hybrid methods were not previously used for TOC and E. coli prediction and proved to provide a reliable and direct approach to complement existing monitoring techniques in the Milwaukee River system with satisfactory prediction accuracies. Limited studies have been conducted so far for developing TOC prediction models using ML in natural streams. Living microorganism behavior is harder to predict than physical and chemical processes. We developed prediction models for E. coli and efforts were made to explain the variability in living microorganism behavior based on the specific physicochemical parameters and to identify the most influential water quality parameters in predicting E. coli. These algorithms analyzed real-time data of source water, found the underlying pattern in a large volume of data using a mapping function, and identified complex relationships among the outputs and inputs, which are unattainable by traditional or process-based methods for water quality analysis.
This study predicted TOC and E. coli concentration in three major rivers: the Milwaukee River, Menomonee River, and Kinnickinnic River within the Milwaukee River basin during a sampling period of 2000–2020 using ML methods. We developed and evaluated the efficiencies of different regression ML models including ANN, SVM, GBM, RF, and ensemble-hybrid models such as RF-SVM, ANN-SVM, GBM-SVM, RF-ANN, GBM-ANN, and RF-GBM in predicting TOC and E. coli using the specific water quality parameters. Also, identifying the most influential physicochemical parameters in predicting both TOC and E. coli is one of the objectives of the study. More importantly, we conducted a comprehensive assessment of the employed ML techniques and evaluated the differences in model performances for predicting two different outputs (TOC and E. coli) using a specific dataset.
Fig. 1 Water quality monitoring sites of the Milwaukee River, Menomonee River, and Kinnickinnic River in Wisconsin. |
ANN is a good approach for regression problems with complex datasets. The model consists of one input layer, one or several hidden layers, and one output layer. The hidden layers include many interconnected units (neurons) arranged with the input vectors to convert them into output using an activation function. In a feed-forward network such as multilayer perceptron (MLP), each unit feeds its output to all the units on the next layer. In this analysis, we used MLP (Fig. 2) with two hidden layers and five units in each layer which resulted in the best model performances for predicting both TOC and E. coli.
SVM is used as a SVR in regression problems which finds a decision boundary or hyperplane to classify data points appropriately. SVM uses a kernel method (e.g., RBF, sigmoid kernel, linear kernel, and polynomial kernel) that converts the original input 2-dimensional data space into a higher dimensional feature space. We used the RBF kernel function and optimized the two key parameters: regularization parameter (C) and kernel width (gamma).
RF is an ensemble ML model that combines multiple decision trees to build an effective prediction model. The model makes different random choices to develop several independent trees. The trees are randomized by selecting the data points to build trees and the maximum features in each split test. Each tree in the forest predicts the output, and the final output is determined by averaging the outputs from all the decision trees.
GBM works by building multiple models or decision trees sequentially and reducing the errors from the previous model. Each decision tree takes a portion of the input data and makes predictions. The new models are built over the errors or residuals of the previous predictions. Several decision trees are added iteratively to improve the prediction performance. The degree to which each model is allowed to correct the errors from the previous tree is controlled by the key parameters: number of decision trees and learning rate.
In addition, several ensemble-hybrid models such as RF-SVM, ANN-SVM, GBM-SVM, RF-ANN, GBM-ANN, and RF-GBM were developed by integrating the standalone traditional ML algorithms. An ensemble meta estimator ‘voting regressor’ was used to fit the dataset on each contributing standalone model in this hybridization process. The final prediction of the hybrid model was determined by averaging the individual prediction of each standalone model. The contributing models were optimized to achieve the best performance of the ensemble-hybrid model. The generalization performances of ML models were improved by the model's parameter tuning. We used the grid search and five-fold cross-validation method and examined possible combinations of the hyperparameters that control the learning process.
Parameter | Unit | Minimum | Maximum | Mean | Standard deviation | Coefficient of variation % |
---|---|---|---|---|---|---|
TS | mg L−1 | 100.00 | 8400.00 | 633.57 | 412.17 | 65.05 |
TSS | mg L−1 | 0.80 | 2700.00 | 20.18 | 60.37 | 299.20 |
VSS | mg L−1 | 0.30 | 260.00 | 5.07 | 8.49 | 169.71 |
Chlorophyll a | mg m−3 | 0.08 | 280.00 | 9.83 | 15.68 | 159.45 |
Turbidity | NTU | 0.60 | 378.00 | 14.19 | 23.49 | 165.52 |
pH | — | 5.50 | 9.93 | 8.05 | 0.38 | 4.73 |
EC | μS cm−1 | 101.00 | 15600 | 1045.18 | 765.44 | 73.23 |
Temperature | °C | −0.50 | 33.44 | 14.12 | 7.92 | 56.12 |
DO | mg L−1 | 0.00 | 25.70 | 9.88 | 3.36 | 33.97 |
Nitrate | mg L−1 | 0.00 | 4.30 | 0.81 | 0.56 | 69.68 |
Alkalinity | mg L−1 | 4.50 | 440.00 | 223.38 | 71.04 | 31.80 |
Total P | mg L−1 | 0.00 | 2.60 | 0.12 | 0.11 | 88.38 |
Chloride | mg L−1 | 5.00 | 3100.00 | 174.84 | 210.51 | 120.40 |
BOD5 | mg L−1 | 0.00 | 310.00 | 3.98 | 12.00 | 301.03 |
TOC | mg L−1 | 0.67 | 190.00 | 8.25 | 7.27 | 88.13 |
DOC | mg L−1 | 0.52 | 190.00 | 7.89 | 7.04 | 89.27 |
E. coli | MPN/100 mL | 0.00 | 250000 | 3643.51 | 16461.95 | 451.82 |
We performed Pearson's correlation analysis at a 0.05 level of significance to identify the input parameters that could impact the output variables such as TOC and E. coli. The results in Table 2 indicated significant correlations (p-value < 0.05) between TOC and input variables. Among the parameters, DOC was strongly positively correlated with TOC (R-value = 0.975, p < 0.05). BOD5 had a moderately strong linear correlation with TOC (R-value = 0.725, p < 0.05). TS, chloride, and EC had weak linear correlations with R-values of 0.423, 0.408, and 0.405, respectively. Temperature and pH were negatively correlated with TOC. Other parameters had very weak or non linear correlation with TOC. Similarly, Table 3 shows the Pearson's correlation coefficient between E. coli and other input variables. The results indicated that the physicochemical parameters had weak (R-value < 0.3) or non linear correlations with E. coli.
Parameter | Correlation coefficient | P-Value | Parameter | Correlation coefficient | P-Value |
---|---|---|---|---|---|
TS | 0.423 | 0.000 | Temperature | −0.147 | 0.000 |
TSS | 0.036 | 0.005 | DO | −0.018 | 0.003 |
VSS | 0.091 | 0.000 | Nitrate | 0.055 | 0.000 |
Chlorophyll | −0.010 | 0.032 | Alkalinity | 0.048 | 0.000 |
Turbidity | 0.013 | 0.039 | Chloride | 0.408 | 0.000 |
pH | −0.105 | 0.000 | BOD5 | 0.725 | 0.000 |
EC | 0.405 | 0.000 | DOC | 0.975 | 0.000 |
Depth | −0.054 | 0.000 | E. coli | 0.058 | 0.000 |
Total P | 0.089 | 0.000 |
Parameter | Correlation coefficient | P-Value | Parameter | Correlation coefficient | P-Value |
---|---|---|---|---|---|
a TS: total solids; TSS: total suspended solids; VSS: volatile suspended solids; EC: electrical conductivity; Total P: total phosphorus; DO: dissolved oxygen; BOD5: 5 day biochemical oxygen demand; DOC: dissolved organic carbon; TOC: total organic carbon. | |||||
TS | −0.056 | 0.000 | Temperature | 0.108 | 0.000 |
TSS | 0.110 | 0.000 | DO | −0.136 | 0.000 |
VSS | 0.178 | 0.000 | Nitrate | −0.049 | 0.000 |
Chlorophyll | −0.001 | 0.027 | Alkalinity | −0.247 | 0.000 |
Turbidity | 0.151 | 0.000 | Chloride | −0.034 | 0.009 |
pH | −0.131 | 0.000 | BOD5 | 0.079 | 0.000 |
EC | −0.076 | 0.000 | DOC | 0.047 | 0.000 |
Depth | 0.001 | 0.048 | TOC | 0.058 | 0.000 |
Total P | 0.258 | 0.000 |
To improve the model performance efficiency, we extracted the important predictors from the feature importance chart, eliminated the predictors with lower scores, and developed models with only feature importance. The results indicated that the model performance was significantly improved with higher accuracy (accuracy 90%) and lower values of error with the input variable combination of BOD5, DOC, EC, chloride, TS, nitrate, VSS, DO, turbidity, pH, and TSS. Based on the analysis, we selected the input combination of 11 water quality parameters out of the 18 parameters that had comparatively higher feature importance scores to develop TOC prediction models. For selecting the best combination of input variables for TOC prediction models, we evaluated the models based on RMSE scores with seven different combinations (category 1–category 7) of input variables as shown in Table 4. The models were trained using each of the categories of input variables and evaluated on the test set to identify the best combination of input variables that resulted in the lowest RMSE scores. For each category of input variables, the models were optimized to find the hyperparameters that resulted in the lowest RMSE scores. In category 1, all of the input variables were considered for predicting the output, while categories 2 to 7 include the variables that had relatively higher feature importance scores. The results indicate that for both RF and GBM models, the lowest RMSE scores were found for category 2 (BOD5, DOC, EC, chloride, TS, nitrate, VSS, DO, turbidity, pH, and TSS). For ANN, category 3 (BOD5, DOC, EC, chloride, TS, nitrate, VSS) had the lowest RMSE score of 1.601, which is close to the RMSE score for category 2 (1.669). Based on the analysis, category 2 was considered the optimal combination of input variables that had the lowest RMSE scores for most of the ML models. Due to the different structures of the ML models contributing to the variations in their learning process and prediction capability, the best combination of input variables was not consistent for them.
Category | Input variable combinations | RMSE score | |||
---|---|---|---|---|---|
ANN | SVM | RF | GBM | ||
1 | DOC, BOD5, EC, chloride, TS, nitrate, VSS, DO, turbidity, pH, TSS, Total P, E. coli, temperature, alkalinity, chlorophyll, and depth | 2.917 | 2.469 | 3.135 | 2.485 |
2 | BOD5, DOC, EC, chloride, TS, nitrate, VSS, DO, turbidity, pH, and TSS | 1.669 | 2.297 | 2.596 | 2.077 |
3 | BOD5, DOC, EC, chloride, TS, nitrate, and VSS | 1.601 | 2.675 | 3.114 | 2.450 |
4 | BOD5, DOC, EC, chloride, TS, and nitrate | 2.191 | 2.636 | 2.944 | 2.110 |
5 | BOD5, DOC, EC, chloride, and TS | 1.715 | 2.216 | 2.963 | 2.577 |
6 | BOD5, DOC, EC, and chloride | 2.415 | 2.662 | 2.696 | 2.670 |
7 | BOD5, DOC, and EC | 2.732 | 2.461 | 2.831 | 2.110 |
For the prediction of E. coli, although Pearson’s analysis indicated a poor correlation between E. coli and other parameters, the feature importance analysis computed from the RF algorithm was able to capture non-linear relationships between the input and output. The result indicated that BOD5 was the most important variable for predicting bacteria with a feature importance score of 0.13 (Fig. 4). Other influential variables were DO, Total P, temperature, turbidity, and nitrate with feature scores of 0.12, 0.10, 0.09, 0.08, and 0.07, respectively. The feature importance scores were comparatively lower for E. coli prediction than for TOC. For developing prediction models of E. coli, only the predictors (BOD5, DO, total phosphorous, temperature, turbidity, nitrate, and alkalinity) that had relatively higher feature importance scores were selected. For selecting the input variables for E. coli prediction, the models were evaluated based on the RMSE scores and the best combination of input variables was selected as category 3 (BOD5, DO, Total P, temperature, turbidity, nitrate, and alkalinity) that resulted in the lowest RMSE for all the models as shown in Table 5.
Category | Input variable combinations | RMSE score | |||
---|---|---|---|---|---|
ANN | SVM | RF | GBM | ||
1 | DOC, BOD5, EC, chloride, TS, nitrate, VSS, DO, turbidity, pH, TSS, Total P, TOC, temperature, alkalinity, chlorophyll, and depth | 5873.52 | 6313.48 | 12257.70 | 12598.07 |
2 | BOD5, DO, Total P, temperature, turbidity, nitrate, alkalinity, VSS, and pH | 6612.67 | 6364.12 | 12599.84 | 13030.61 |
3 | BOD5, DO, Total P, temperature, turbidity, nitrate, and alkalinity | 5820.09 | 6184.76 | 12040.45 | 12221.30 |
4 | BOD5, DO, Total P, temperature, and turbidity | 6315.40 | 6477.71 | 13038.27 | 12801.73 |
5 | BOD5, DO, Total P, and temperature | 6796.09 | 6730.16 | 12970.40 | 13015.61 |
6 | BOD5, DO, and Total P | 6583.24 | 6829.70 | 13788.59 | 13500.69 |
The result indicated that the E. coli concentration was associated with BOD5 and DO in surface water. With the increasing growth of microorganisms, the rate of decomposition of organic matter also increases, which results in dissolved oxygen level depletion and an increased BOD level in water.39 The growth of bacteria depends on the availability of nutrients (e.g., nitrogen and phosphorus) and the appropriate temperature. Turbidity can also affect microbial growth in water. Bacteria have the potential to attach to the surface of a particulate turbidity causing material influencing the inactivation of microorganisms.40 Alkalinity also contributes to some extent to the prediction of bacteria levels. A previous study41 shows that an appropriate alkaline environment can effectively inhibit the growth of microorganisms through the inactivation of ATP synthesis. Although the influence of the physical and chemical parameters on the prediction of E. coli was poor, the decision tree-based ML algorithm was able to extract data-driven information about the non-linear relationships that could exist between the inputs and output.
Model's key parameters | Optimal values | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
ANN | SVM | RF | GBM | RF-SVM | GBM-SVM | RF-GBM | RF-ANN | GBM-ANN | ANN-SVM | |
Hidden layer | (5,5) | (4,4) | (4,4) | (5,5) | ||||||
Activation | ‘relu’ | ‘relu’ | ‘relu’ | ‘relu’ | ||||||
Alpha | 0.1 | 0.005 | 0.01 | 0.22 | ||||||
Solver | ‘lbfgs’ | ‘lbfgs’ | ‘lbfgs’ | ‘lbfgs’ | ||||||
Kernel | Rbf | ‘rbf’ | ‘rbf’ | ‘rbf’ | ||||||
C | 500 | 500 | 700 | 650 | ||||||
Gamma | 0.001 | 0.001 | 0.0001 | 0.0001 | ||||||
n-Estimators | 150 | 100 | 150 | 500 | 100 | 50 | 250 | |||
Max-features | 8 | 8 | 5 | 8 | ||||||
Max-depth | 3 | 2 | 6 | 3 | 6 | 5 | 3 | |||
Learning rate | 0.08 | 0.085 | 0.08 | 0.085 |
Model's key parameters | Optimal values | |||||||
---|---|---|---|---|---|---|---|---|
ANN | SVM | RF | GBM | RF-GBM | RF-ANN | GBM-ANN | ANN-SVM | |
Hidden layer | (5,5) | (4,4) | (4,4) | (5,5) | ||||
Activation | ‘relu’ | ‘relu’ | ‘relu’ | ‘relu’ | ||||
Alpha | 0.025 | 0.02 | 1.1 | 0.6 | ||||
Solver | ‘lbfgs’ | ‘lbfgs’ | ‘lbfgs’ | ‘lbfgs’ | ||||
Kernel | Rbf | ‘rbf’ | ||||||
C | 500 | 500 | ||||||
Gamma | 0.05 | 0.06 | ||||||
n-Estimators | 200 | 200 | 10 | 12 | 100 | |||
Max-features | 8 | 4 | 4 | |||||
Max-depth | 9 | 2 | 9 | 11 | 5 | |||
Learning rate | 0.08 | 0.06 | 0.08 |
Table 8 shows the performances of the developed four standalone and six hybrid ML models for TOC prediction based on the R2-value, RMSE, MSE, and MAE. The results indicated that the standalone ML models had prediction accuracies within a range of 89.9–95.8% indicating that the models performed satisfactorily in predicting TOC and that the models are considered as a good fit for the dataset. We also developed ensemble-hybrid models that further improved the prediction performances of the models, ranging from 94.8–97.0% accuracy. Based on the R2-value and errors, the developed models were ranked from the best performing to the worst performing model as shown in Table 8. Among the developed TOC prediction models, the best four performing models were the ensemble-hybrid models ANN-GBM, SVM-GBM, ANN-SVM, and ANN-RF with prediction accuracies greater than 96%. The hybrid model ANN-GBM outperformed others with an R2 value of 0.97, MAE of 0.664, MSE of 2.334, and RMSE of 1.528 when using the selected input features computed from the feature importance analysis. The performance metrics indicated that the employed regression models can efficiently predict TOC based on the combination of input features: BOD5, DOC, EC, chloride, TS, nitrate, VSS, DO, turbidity, pH, and TSS. The correlation between the actual and predicted TOC for the employed ML models is presented in Fig. 5 and 7. From the scatter plots we observed that most of the data points best fit with the regression line that explained the percent of variance of the output through the input variables. The time variation graphs in Fig. 5–7 indicate that the prediction models exhibited a small deviation between the predicted and actual value for each sample observation of the test set, verifying a good generalization capacity of the models with satisfactory R2 values.
Algorithms | MAEa | MSEa | RMSEa | R 2 | Rank |
---|---|---|---|---|---|
a The units for MAE, MSE, and RMSE are in ‘mg L−1’. | |||||
ANN | 0.750 | 2.788 | 1.669 | 0.958 | 5 |
GBM | 0.718 | 4.315 | 2.077 | 0.936 | 8 |
SVM | 0.807 | 5.276 | 2.297 | 0.921 | 9 |
RF | 1.177 | 6.739 | 2.596 | 0.899 | 10 |
ANN-GBM | 0.664 | 2.334 | 1.528 | 0.970 | 1 |
SVM-GBM | 0.652 | 2.366 | 1.538 | 0.965 | 2 |
ANN-SVM | 0.672 | 2.394 | 1.547 | 0.964 | 3 |
ANN-RF | 0.703 | 2.626 | 1.620 | 0.961 | 4 |
SVM-RF | 0.722 | 2.888 | 1.699 | 0.957 | 6 |
RF-GBM | 0.738 | 3.514 | 1.875 | 0.948 | 7 |
In the hybridization process of model development, two base algorithms were integrated to develop a model with greater flexibility and higher prediction accuracy than the standalone models. For example, the standalone model ANN and GBM had TOC prediction accuracies of 95.8% and 93.6%, respectively. Although both standalone models performed satisfactorily, the performances of the models were further improved by hybridizing ANN with the GBM algorithm with a TOC prediction accuracy of 97.0%. When developing the ANN-GBM model, the hyperparameters for both algorithms (hidden layer size, number of nodes, activation function, solver, alpha, learning rate, n_estimators, and max_depth) were defined and optimized to achieve the best performance of the hybrid model. The ensemble meta-estimator ‘VotingRegressor’ was applied to fit the standalone algorithms on each dataset and the final prediction was determined by averaging the individual predictions of the contributing models. The ensemble-hybrid model ANN-GBM outperformed others because of the significant advantages of ANN over other regression models, i.e. ANN's ability to learn and model complex non-linear relationships between the dependent and independent variables and establish all possible interactions between the dependent variables without requiring the need for making assumptions about data properties, data distribution, and specific hypothesis for testing. The ANN model benefitted from the mathematical functions of hidden layers consisting of neurons that assigned weights to the inputs, directed them to an activation function, and performed specific non-linear transformations of the input data. The activation function allowed complex functional mapping of the network's input and output with the dataset of non-linearity. In addition, the boosting mechanism of GBM with properly optimized hyperparameters allowed the building individual decision trees at a time and learn from the mistakes of previous trees to improve the overall performance sequentially with each iteration. GBM overcame the errors of decision trees by using gradients in the loss function and optimizing the model's coefficients to fit the underlying data. The incorporation of a boosting mechanism along with the non-linear transformation of the input data using an activation function allowed the extraction of specific patterns from the data and minimized the difference between the actual and the predicted output, resulting in a more powerful ensemble-hybrid model ANN-GBM.
Learning curves were developed to indicate whether the models were a good fit, underfit, or overfit, based on the performance of the training and validation sets. The plots of learning curves in Fig. 8 indicate the learning and generalization performance of the standalone models: RF, GBM, and ANN over experience. A good fit model is represented by a training and validation score that approaches a point of stability with a minimal gap between these two scores. The score should be lower on the training set than the validation set, creating a generalization gap between the two curves. The MSE score was used to evaluate the performances of the models for the specific training size. For RF and GBM, the training score and validation score moved toward a stable point with a small gap between the curves which decreased with the increase in training size. When the training size increased to 3000, the training MSE remained constant, while the validation MSE started decreasing significantly. Similarly, the learning curves for the ANN model indicated that the model generalized well with the specific training size, and the training score and cross validation score converged at the point of stability with a minimum training size of 4000. The results indicated that the models generalized well on the validation set with a training size of 4482 sample observations and were considered as good fit models with lower MSE scores.
Fig. 8 Learning curves indicating performances of RF, GBM, and ANN (MLP) models for TOC prediction based on the MSE score with varying training size. |
Prediction models for E. coli were developed using four standalone and four hybrid ML algorithms. Initially, the models were developed using all input features. To reduce the number of redundant features and improve performance efficiency, the important variables with relatively higher feature importance scores were considered. The prediction performances of the developed ML models with all input features and the feature importance (BOD5, DO, total phosphorous, temperature, turbidity, nitrate, and alkalinity) are shown in Table 9. The results indicated that the R2 values ranged from 0.26–0.40 when using all input variables while with the feature importance the prediction performances of the models were found to be within a range of 0.29–0.42. Among the developed ML models, the hybrid model ANN-GBM exhibited the highest prediction accuracy of 42%.
From the statistical analysis of microbial data, we observed that E. coli concentrations varied significantly during the twenty years sampling period. The ML models performed poorly because of the high variation in bacterial data, and it was difficult to explain such variability based on the input variables of physicochemical water quality parameters. Also, no significant strong linear correlation was found between E. coli and input variables. The ML algorithms were able to explain some percentage of variability in the data by extracting useful data-driven information about the existing hidden non-linear relationships between the output and input variables. However, as ML models were used as black boxes in predicting the output, there was little understanding of how the models explained such variability and arrived at a prediction with prediction accuracies within a range of 29–42%. In addition, we observed higher values of MAE for the prediction models. The MAE measures the difference between the actual and predicted value. As the measuring values were found to be within a high range with a maximum value of 250000 MPN per mL, the difference between the actual and predicted value was also found to be higher. The models' performances might be improved if, besides the physicochemical parameters, other hydrometeorological variables such as air temperature, air humidity, atmospheric pressure, precipitation level, and stormwater runoff flow would be available during the sampling period to consider as inputs to the models. Because of the unavailability of the hydrometeorological data for the corresponding E. coli concentrations, we only used the available physicochemical parameters measured by MMSD for the river system.
A previous study29 applied neural network models (LMNN and MDNN), and an adaptive neuro-fuzzy inference system (ANFIS) which is a combination of a neural network and fuzzy system for forecasting TOC in a river in South Korea. The study used continuous water quality monitoring data of DO, water temperature, discharge, and TOC from a specific monitoring station. The results indicated that the R-values ranged from 0.489 to 0.783 for the prediction models. Also, the result showed that the ANN model was better with an R-value of 0.743 than the conventional model used in that study. Kim et al. (2021)31 developed standalone (MARS and M5Tree) and hybrid models (CEEMDAN-MARS, CEEMDAN-M5Tree, MARS-CSA, and CEEMDAN-MARS-CSA) for predicting TOC in a river using water quality data of pH, electrical conductivity, temperature, DO, COD, and suspended solids (SS) that were collected from two monitoring stations of that river. The R-values ranged from 0.458–0.728 for the standalone models, while for the hybrid models the range was found to be 0.539–0.762. The CEEMDAN-MARS-CSA (completely enhanced EMD with adaptive noise) model was found to be the most accurate in predicting TOC with a correlation coefficient of 0.762. To the best of the authors' knowledge, no studies have been conducted so far that employed the specific ensemble-hybrid models for TOC prediction that we developed in this study. The novel ensemble-hybrid models in our study exhibited a high prediction accuracy of greater than 96% (R2 > 0.96). Previous studies29–31 applied different ML models for predicting TOC. However, they did not apply any tree-based RF and GBM algorithms that proved to be effective in our study, especially when ensembled with other algorithms. Most of the previous studies were conducted within a specific location and from one or two monitoring stations. In this study, we used several water quality parameters and data with large spatial and temporal variations from 32 monitoring stations of three different rivers during a twenty years sampling period. Also, we performed an analysis of feature importance using a tree-based algorithm that identified the most important parameters for predicting the output, while other studies29–31 considered several water quality parameters that were available (Kim et al. (2021):31 pH, electrical conductivity, temperature, DO, COD, and SS; Goz et al. (2019):30 pH, temperature, conductivity, and turbidity; Yeon et al. (2008):29 DO, temperature, and discharge as inputs for TOC prediction models without analysis of feature importance.
For the prediction of E. coli, several studies32–35 investigated regression-based techniques for microbial analysis of surface water and groundwater using both physicochemical and hydrometeorological parameters. He et al. (2008)34 applied ANN for the prediction of total coliform, fecal coliform, and Enterococci using pH, conductivity, water temperature, rainfall, wave height, tide height, and flow rate as inputs to the models, and the models resulted in R2 values within a range of 0.620–0.883. However, in this study, we considered only the physicochemical water quality parameters to explain the variability in living microorganism behavior through the independent water quality parameters. Our goal was to investigate how accurately we can predict the microbial concentration (using ML algorithms) only from the water quality parameters without considering physical characteristics such as flow, velocity, river width, etc. The results of our study indicated that for E. coli prediction, because of the high variability of bacterial data, it was difficult to explain such a large amount of unexplained variation in the dataset based on the available physicochemical parameters, resulting in relatively lower R2 values within a range of 0.37–0.42 for the ensemble-hybrid models.
In this study, we developed and evaluated the efficiency of several standalone and hybrid ML models for the prediction of TOC and E. coli in the major rivers of the Milwaukee River basin. Also, we identified the most influential parameters in predicting TOC and E. coli by interpreting a large water quality dataset. For TOC prediction, the most influential variables were identified as BOD5, DOC, EC, chloride, TS, nitrate, VSS, DO, turbidity, pH, and TSS, while for E. coli prediction the decision-tree-based algorithm indicated BOD5, DO, total phosphorous, temperature, turbidity, nitrate, and alkalinity as the relatively important features. The feature importance scores of the input variables for E. coli prediction were less than those for TOC prediction. However, the ML algorithm was able to extract useful data-driven information about the hidden complex non-linear relationships between the bacteria concentration and other physicochemical parameters and indicated BOD5, DO, and total phosphorous to be the most influential parameters for predicting E. coli. With a specific dataset, the ML models performed satisfactorily for TOC prediction with high prediction accuracies of greater than 96%. However, for E. coli prediction, as living microorganism behavior is harder to predict than physicochemical parameters, and because of the presence of a greater amount of unexplained variation in the dataset that could be explained based on the available physicochemical parameters, the models resulted in relatively lower R2 values. The results indicated that for both TOC and E. coli prediction with a specific dataset, ANN-GBM outperformed others with prediction accuracies of 97% (Table 8) and 42% (Table 9), respectively. The reason is that the hybrid model benefitted from the advantages of the specific activation function of ANN in performing non-linear transformations of the input data and allowing complex functional mapping of the network's input and output. In addition, the errors in the prediction of individual trees developed by the GBM algorithm were overcome by boosting mechanisms and optimizing the coefficients to fit the underlying data. The incorporation of a boosting mechanism along with the non-linear transformation of the input data using an activation function allowed the extraction of specific patterns from the data and minimized the difference between the actual and the predicted output, resulting in a more powerful ensemble-hybrid model ANN-GBM. The results indicate that the ANN algorithm fits the data well with R2-values of 0.958 and 0.38, respectively for TOC and E. coli. However, when ensembled with other algorithms, for example, the tree-based algorithm GBM and RF, performance accuracies were further improved for ANN-GBM (TOC: 0.970, E. coli: 0.42) and ANN-RF (TOC: 0.961, E. coli: 0.41). In addition, similar performances were noticed for most of the employed ensemble-hybrid models.
This journal is © The Royal Society of Chemistry 2023 |