Setare Loh Mousavi and
S. Maryam Sajjadi*
Faculty of Chemistry, Semnan University, Semnan, Iran. E-mail: sajjadi@semnan.ac.ir; Fax: +98 23 33384110; Tel: +98 23 31533192
First published on 8th August 2023
In this work, a quantitative structure–activity relationship (QSAR) study was performed on a set of emerging contaminants (ECs) to predict their rejections by reverse osmosis membrane (RO). A wide range of molecular descriptors was calculated by Dragon software for 72 ECs. The QSAR data was analyzed by an artificial neural network method (ANN), in which four out of 3000 theoretical molecular descriptors were chosen and their significance was computed based on the Garson method. The significance trends of descriptors were as follows in descending order: ESpm14u > R2e > SIC1 > EEig03d. The selected descriptors were ranked based on their importance and then an explorative study was conducted on the QSAR data to show the trends in molecular descriptors and structures toward the rejections values of ECs. The MLR algorithm was used to make a linear model and the results were compared with those of the nonlinear ANN algorithm. The comparison results revealed it is necessary to apply the ANN model to this data with non-linear properties. For the whole dataset, the correlation coefficient (R2) and residual mean squared error (RMSE) of the ANN and MLR methods were 0.9528, 6.4224; and 0.8753, 11.3400, respectively. The comparison results showed the superiority of ANN modeling in the analysis of ECs' QSAR data.
Biological treatment strategies include two types of processes such as aerobic and anaerobic. Some common aerobic technologies are membrane bioreactors, active sludge, and a sequencing batch reactor. Anaerobic treatments include anaerobic film reactors and anaerobic sludge reactors.10,11 However, biological and conventional wastewater treatment display limited performance. For instance, they are not able enough to entirely remove certain ECs to acceptable concentration levels in which they are safe for human utilization. Overall, biological processes and conventional treatment strategies are not versatile toward the removal of different classes of micropollutants and they lead to insufficient removal of many micropollutants from water.12–14
On the contrary, advanced processes have shown great ability to degrade or remove many of these ECs.15 There are many advanced technologies like ultraviolet light, activated carbon, and membrane.16,17 The membrane filtration process includes nanofiltration (NF), microfiltration (MF), ultrafiltration (UF), and reverse osmosis (RO) methods. One of the most important membrane filtrations is the RO membrane which processes the solution–diffusion mechanism for transporting organic solutes over the osmotic membranes.18
Although RO membrane can provide efficient removal of various high molecular weight (MW) compounds such as pharmaceuticals, this is inefficient for the removal of low MW compounds. The permeation of organic molecules on RO membranes can be affected by three important factors: (i) RO operating conditions such as temperature, and pH; (ii) membrane properties, for instance, membrane fouling, and pressure; and (iii) molecular physicochemical properties of contaminants including charge/shape/size, functional groups, and hydrophobicity.19,20 In determining the rejection of compounds by membranes, a crucial challenge is membrane fouling. Although membrane cleaning can reverse fouling and as a result prolong its useful lifespan, it needs chemicals that may degrade the structure of membranes. The difficulties in operational experiments guide researchers to find an ideal model to correlate the structures of ECs and their rejections which can apply to predicting the rejection of a wide range of new ECs.16,20
Quantitative structure–activity relationship (QSAR) is an efficient developed model in computational chemistry and used in different scientific fields (environmental engineering, material science, toxicology, and medicinal chemistry) for correlating, quantitatively an activity or property of molecules with chemical structures.21,22 This method finds the relationship between the molecular structure and its physicochemical properties to evaluate the structure and properties of new molecules without experimenting.23
In the QSAR method, theoretical descriptors are a group of numerical indices that are associated with the structure of molecules and encode information about the structure.24–26 There is a variety type of descriptors such as the number of walks and paths, topological descriptors, three-dimensional MoRSE descriptors, standing for molecular representation of structures based on Electronic diffraction; and counting of functional groups.27–30
There are various software for computing descriptors, some of which commonly used are as follows: comparative receptor surface analysis (CoRSA), comparative molecular field analysis (CoMFA), self-organizing molecular field analysis (SOMFA), hydrophobic interactions (HINT), property evaluation by class variables (PRECLAV), and Dragon.31–36
Each software possesses a different algorithm and provides different kinds of descriptors. CoMFA is based on molecular field analysis and represents real three-dimensional descriptors.37 CoRSA generates a virtual receptor model by considering the common electrostatic and steric properties of a set of molecules.31 SOMFA has a similarity in concept with CoMFA and can be applied in three-dimensional QSAR studies.38 HINT has been designed to map and calculate the hydrophobic environment of small proteins and molecules.32 The PRECLAV computes almost 400 constitutional, geometrical, topological, electrostatic, electronic, and quantum “global” descriptors.39 The Dragon can provide nearly 5000 molecular descriptors composed of not only the simplest atom types, fragment counts, and functional groups, but also several geometrical and topological.40
The predictive capability of the QSAR technique is determined by the method used for modeling. Two methods of linear and non-linear modeling determine the mathematical modeling between descriptors and their molecular activity. Linear methods consist of stepwise regression, principal component regression (PCR), principal component analysis (PCA), kernel stone, multiple linear regression (MLR), particle least squares (PLS); and nonlinear approaches including support vector machine (SVM), Kohonen self-organizing map (SOM), radial basis function (RBF), and artificial neural networks (ANN).41–47 The ANN algorithms are non-linear models that make a mapping of the input and output variables, in turn, the map is utilized to predict unknown output as a function of appropriate descriptors.48 The main advantage of ANN methods is that they can incorporate and combine both experimental data and literature-based to solve many problems such as predicting membrane permeability and membrane rejection. This predictive power can be captured to virtually analyze the properties of molecules before testing them in a laboratory.44,45,49–52
There are some publications on applying QSAR modeling to predict the rejection of pollutants using different modeling strategies.16,17,29,50,53,54 For instance, Yangali-Quintanilla et al. studied the rejection of ECs by NF membranes. They used PLS and MLR algorithms on QSAR rejection data to find the relationship between filtration operating conditions, membrane properties, compound properties; and rejection of molecules. Although they applied PCA and stepwise to reduce the number of variables in the modeling processes, the obtained R2 from modeling approaches was up to 0.84. The small value of R2 could be because of the presence of nonlinearity in the data.55 In another research, Yangali-Quintanilla et al. investigated the rejection of molecules using QSAR data and ANN modeling.56 They applied PCA on QSAR data to diminish the number of input variables however, PCA suffers the risk of selecting variables from the input space that may not be related to the output variable of MLR. Moreover, the authors did not examine the importance of the selected descriptors and they did not interpret the trend of ECs' rejections according to theoretical descriptors. Indeed, to the best of our knowledge, there is no exploration study on the relationship between the theoretical molecular descriptors and structure properties of contaminants in their rejection by RO membrane.
Here, we use the experimental data set reported by Breitner et al. to address these neglected issues.57 The variable selection was conducted on QSAR data based on the correlation between the descriptors and rejections. The chosen descriptors were those having high correlation with response and less correlation with the another descriptors.
In this study, we have two main goals; the first one is developing an ANN-QSAR modeling approach for the prediction of rejection compounds according to their structural characteristics by RO membrane. The second one is investigating the effect of functional groups on chemical properties and finding the interactions between compounds and membranes. The interactions depend on some factors such as hydrophobicity/hydrophilicity of molecules and electronegativity of their functional groups, molecular size, and polarity.57
In this work, ANN analysis is applied to QSAR data of ECs using four selected theoretical descriptors including structural information content index (neighborhood symmetry of 1-order) abbreviated as SIC1, R autocorrelation of lag 2/weighted by Sanderson electronegativity (R2e), eigenvalue 03 from edge adjacency matrix weighted by dipole moment (EEig03d), and spectral moment 14 from edge adjacency matrix (ESpm14u). A comprehensive study is conducted on interpreting the QSAR data to understand the relationship between molecular structures of ECs and their rejections based on the values of the selected theoretical descriptors.
All molecular structures (Table 1) were created by the Gaussview 5.0 program59 and optimized in the Gaussian 09 program with the semi-empirical PM6, standing for parameterization method 6.60 PM6 is one of the developed semiempirical techniques which is commonly applied for optimizing the structures of molecules.61 Dragon 5.5-2007 program was used to calculate the molecular descriptors for each compound.62 All statistical computations were conducted in MATLAB 7.0 software and ANN was executed using Matlab Neural Network Toolbox (nntools).46
(1) |
(2) |
One of the most common ANN paradigms used for nonlinear models is the back-propagate feedforward neural network (BPFF), which has been applied in this study.49,50,53,63 In the ANN based on the BPFF method, the weights must be changed in each iteration to achieve the smallest difference between the experimental and predicted outputs by the model. Eqn (3) shows the changing weight in each iteration:
ΔWij + Wij → Wij |
ΔWij = η(t − o)Ini | (3) |
In the BPFF-ANN method, the functioning of the nodes arranged in layers is wherein the input layer receives inputs from the real world. The succeeding layer receives weighted outputs from the preceding layer as its input resulting and, the outputs of the last layer constituting the outputs to the real world.44,45,52,53 A node in the hidden or output layer performs two tasks: first, it sums a bias value plus the weighted inputs from numerous connections and next applies a transfer function to the sum. Second, it propagates the resulting value through outgoing connections to the nodes of the succeeding layer where it undergoes the same process. The number of nodes in the input and output layers is revealed by the number of independent and dependent variables, respectively. In this work, the independent variables are the molecular descriptors and the dependent variable is the rejection parameter. The network can learn the relationships between independent and dependent variables by repeatedly comparing the predicted rejection and the experimental rejection; and the subsequent adjustment of the weight matrix and bias vector of each layer by a backpropagation training algorithm.
To perform an ANN analysis, various initialization method is done to decrease the possibility of convergence to a local minimum and the initialization is used with random weights. The data used in this method are randomly classified into three sets: the test set is employed to avoid the overfitting problem and also shows the optimal number of nodes in the hidden layer, the training set is utilized to adjust the parameters of the weights and finally, the validation set is utilized to confirm the real predictive power of the ANN model.46,54,63–66
In this work, the number of molecular descriptors computed by Dragon software was 3224 and a few important ones should be selected. 1900 out of 3224 descriptors were with all zero element values; therefore, they were omitted from the data set. Furthermore, 980 out of remained descriptors had a high correlation with each other (R > 0.90), which means they possessed similar information about the molecules, which were removed from the next consideration.68 Finally, based on stepwise regression analysis, 11 out of the remained descriptors from the previous step had a high correlation with response and less correlation with each other. These significant descriptors were ranked based on their p-values in ascending order and the four first descriptors were selected for further analysis (the less p-value of the parameter is the more probability of the parameter's significance). The selected descriptors were represented in Table 2.
ID | Name | Description | Block |
---|---|---|---|
1 | SIC1 | Structural information content (neighborhood symmetry of 1-order) | Information indices |
2 | R2e | R autocorrelation of lag 2/weighted by Sanderson electronegativity | GETAWAY descriptors |
3 | EEig03d | Eigenvalue 03 from edge adj. matrix weighted by dipole moment | Edge adjacency indices |
4 | ESpm14u | Spectral moment 14 from edge adj. matrix | Edge adjacency indices |
The second issue in ANN analysis is finding the optimal number of hidden layers and their nodes. Here, ANN optimization was conducted using a toolbox in MATLAB (nntools) based on the BPFF algorithm. The main parameters of the network in the toolbox were as follows: the percentage of data amounts in each classified set (testing, training, and validation), topology, training algorithm, and its factors as presented in Table 3.
Topology | four inputs, one output, and one hidden layer with four neurons (4 × 4 × 1) |
---|---|
Data | Training set: 69.44% randomly selected observation data (50 data values) |
Test set: 15.27% randomly selected observation data (11 data values) | |
Validation set: 15.27% randomly selected observation data (11 data values) | |
Beginning function | Log-sigmoid |
Training algorithm | Levenberge–Marquardt |
Loss function conditions | Minimum MSE |
Stopping conditions | The network stops in one of three ways |
Validation check > 10 | |
Minimum gradient < 10−7 | |
Momentum speed > 1010 |
To find the optimal nodes in the hidden layer, different models with one hidden layer were constructed in which the nodes varied between 1 and 7. Then, the efficiency of each model was evaluated based on correlation coefficient (R2), mean square error (MSE), mean absolute percentage error (MAPE), and residual mean squared error (RMSE).
The above parameters are determined as follows:17,41,42,69,70
(4) |
(5) |
(6) |
(7) |
The main target is minimizing the MSE error of the test set, as data that is not utilized during the train iterations, confirms the power of ANN's ability in the prediction of the new data set.
The ANN optimal structure was achieved according to the maximum amount of R2 and the minimum amount of the MSE of the test set. Fig. 1 displayed a topology of the optimal model in this work.
The ECs rejection was predicted using the optimized ANN model in three sets test, train, and validation, reported in Table 1. The whole of the obtained results was converted to the initial state and plotted in Fig. 2 against the corresponding experimental rejections.
Fig. 2 The scatterplot of predicted rejections of molecules by the ANN method versus corresponding experimental rejections in different data sets. |
In ANN analysis, the statistical parameters of R2, MSE, MPE, and RMSE were obtained for the data sets of the training, testing, and validation, as reported in Table 4. The R2 amounts between the predicted and experimental results show the ANNs are highly effective for making the relationship between the structural properties of ECs and their rejection.
Set of data | R2 | MSE | RMSE | MPE | ||||
---|---|---|---|---|---|---|---|---|
ANN | MLR | ANN | MLR | ANN | MLR | ANN | MLR | |
Total | 0.9528 | 0.8753 | 41.2 | 128.6 | 6.4 | 11.3 | 12.2% | 24.3% |
Training | 0.9434 | 0.8625 | 43.6 | 123.3 | 6.6 | 11.1 | 14.2% | 12.4% |
Test | 0.9583 | 43.9 | 6.6 | 7.3% | ||||
Validation | 0.9759 | 0.9280 | 28.0 | 158.0 | 5.2 | 12.6 | 10.0% | 46.1% |
In the following, the obtained data were investigated by the MLR model,71 and the results were evaluated with the ANN algorithm to reveal the necessity of applied nonlinear models in this research. The QSAR linear equation can be written as in the following:
y = 0.1632 − 0.1963528SIC1 + 0.4113547R2e + 0.1675084EEig03d + 0.867939ESpm14u | (8) |
Fig. 3 displayed the association between the predicted and experimental results of the MLR technique. However, the fit is worse as compared to that given for ANN analysis (Fig. 2), confirming the efficiency of the ANN method for analyzing this QSAR data.
Fig. 3 The scatterplot of predicted rejection of molecules by MLR method versus experimental rejection of molecules in different data sets. |
Furthermore, Fig. 4 illustrates the scheme of experimental rejections versus descriptors SIC1, R2e, EEig03d, and ESpm14u. This figure shows the nonlinear relationship between the structure of molecules and rejections.
(9) |
Based on the Garson we estimate the percentage of the effective input variables on rejection by combining input-hidden and hidden-output connection weights. The results were presented in Table 5. The trend of the importance of input descriptors can be expressed as ESpm14u > R2e > SIC1 > EEig03d. Indeed, the numerical value of the molecular descriptors is important for interpreting the relationship between compound rejection and molecular descriptors and is useful for explaining the results as discussed below.
Input descriptors | Hidden neurons | Hidden to out | |||
---|---|---|---|---|---|
SIC1 | R2e | EEig03d | ESpm14u | ||
−1.0877 | 1.3881 | 0.0052 | 0.9426 | H1 | 0.7219 |
1.8117 | −1.3380 | −1.8139 | −2.1147 | H2 | −0.2206 |
3.6353 | −0.7606 | −1.5711 | 3.9892 | H3 | 0.1505 |
2.2321 | −2.9344 | 2.5124 | 3.6544 | H4 | 0.5867 |
24.97 | 25.72 | 17.35 | 31.94 | Relative importance (%) |
EEig03d represents the eigenvalue 03 from the edge adjacency matrix weighted by dipole moments. This descriptor was assigned to the polarity of molecules, which mostly explains the electronic effect of the compounds and the hydrophobic properties. On the other hand, molecular polarity is an important parameter in rejection by RO membranes because this factor influenced the interaction of the molecules with the membrane and, in turn, the diffusion of the molecules.57,73,74
According to the results, the presence of non-polar or very low polar functional groups increases the numerical value of EEig03d and for compounds with high polarity functional groups, the value is negative. The results showed similar molecules with halogen groups have lower EEig03d than those of methyl groups Table S1.† For instance, the EEig03d value for 2-CT is 1.48 while for CB is 1, as a result, the rejection of 2-CT (88%) is more than CB (63%), and Naph has a 1.61 value of EEig03d by 91% rejection while benzene with 1 value of EEig03d shows 79% rejection.
The methyl group also reduces the polarity, as compounds with one methyl functional group are more polar than those of more methyl functional groups. Hence, these molecules have a low value of EEig03d and are followed by a decrease in rejections. On the contrary, the polar functional groups lead to higher partitioning into the polyamide membrane resulting in lower rejection which coincides with their low value of EEig03d.57 The effect of EEig03d on the polarity of compounds and eventually on RO rejections is presented for two pairs of the same molecules in Table S1.† Such examples are the pair of compounds m-xylenes and toluene; 4-IPT and cumene; t-1,3-DCP and t-1,4-DCB; 2-But and 2-Hex; 1,2-DB-3-CP and DBCM; MTBE and TBEE. It should be mentioned in MLR analysis, EEig03d reported positive regression coefficients, which offered the descriptor had a positive effect on rejection, consequently, by increasing the value of EEig03d, rejection is increased.74
ESpm14u is the first molecular descriptor with a high positive contribution and displays the spectral moment 14 from the edge adjacency matrix. Spectral moments are the most important factors that can be calculated to many different matrices utilized to represent the structure of the states of various systems. The spectral moments k of a matrix M of the molecular graph G is one of the most suitable molecular descriptors for QSAR models of complex structures.75,76 The line graph of the chemical graph represents the sum of all Self-Returning Walks of length r, that begins and ends with a similar vertex.77,78 In the present study, the results of the MLR model showed that ESpm14u has a positive effect on rejection, which is in good agreement with the high value of ESpm14u for large molecules. Interestingly enough, the numerical value of ESpm14u for larger molecules increases.
There are four parameters attributed to the molecular size as follows: MW, volume, and molecular length or width, all of which are used to explain the rejection of organic compounds by the RO membrane. For example, the ESpm14u value for t-1,4-DCB is 8.341, in contrast, for t-1,2-DCE is 5.549 because in t-1,4-DCB the number of atoms is more than t-1,2-DCE, as a result, the rejection of t-1,4-DCB (51%) is higher than t-1,2-DCE (15%). And TBA has a 15.381 value of ESpm14u by 99% rejection compared to IPA has 9.704 and 91% rejection. Similar examples of the pair of compounds are as follows (Table S1†): 1,1,2,2-TCA and 1,1-DCA; 1,1-DCE and 1,1-DCP; 2-But and 2-Hex; MIBK and acetone; BCM and BDCM; cis-1,2-DCE and cis-1,3-DCP; DBCM and DBM; EB and cumene.
R2e is one of the types of molecular descriptors obtained from the R indices of the R-GETAWAY group. In general, GETAWAY is an acronym for topology, atomic masses, and geometry assembly.79 Indeed, R-GETAWAY molecular descriptors combine the information provided by the molecular influence matrix with geometric interatomic distances in the compound. The R2e is a kind of autocorrelation of lag 2 weighted by atomic Sanderson electronegativities, which encodes geometrical information given by the chemical information from electronegativity.76,80
Here, the result shows that compounds with larger R2e numerical values attributed to the compounds with the more electronegative groups and also lower rejection. For example, 1,2-DCP (R2e = 1.861) has a rejection 91%, in contrast, 1,2-DB-3-CP (R2e = 1.675) has a rejection 97%; EDB (R2e = 1.743) by rejection 40% and 1,2-DCA (R2e = 1.89) by rejection 34%; the numerical value of R2e for CF is 2.381 with rejection 73% and the value of R2e for BF is 1.95 with rejection 85%; MC (R2e = 2.282) has rejection 10% and BDCM (R2e = 2.221) has rejection 82%. It should be noted that the presence of an electropositive group in the molecule makes the R2e value reduces and rejection increases. Such examples are the pair of compounds CA and VC; 1,1,2-TCA and 1,2,3-TCP; BCM and DCM; C-Tet and 1,2,3-TCP.
SIC1 is the structural information content of order 1. It represents a general measure of structural complexity and encodes information about atom equivalence. The high value of SIC1 is a sign of relatively branched, large, and polycyclic compounds.76,81,82 As seen in Table S1,† the SIC1 values increase regularly in a series of molecules as branching decreases.
The numerical value of SIC1 for 1,1,1,2-TCA is 0.583 and for 1,1,2-TCA is 0.604 as 1,1,1,2-TCA has a higher rejection (Rej = 99%) than 1,1,2-TCA (Rej = 86%), other instance is C-Tet with 0.311 value of SIC1 that has a higher rejection (Rej = 97%) than BF (Rej = 85%) with 0.59 value of SIC1. Similar results are seen in cumene versus EB; n-BB and n-PB; TBB and toluene (Table S1†). From the results of MLR analysis, the negative regression coefficient of SIC1 argues that SIC1 has a negative effect on rejections, which is consistent with the results obtained for the above pair molecules examples. Fig. 5 shows the MLR coefficients vs. the descriptors.
The leverage approach is one of the most common algorithm to visualize the AD of QSAR models.84 In this strategy, the distance of a compound from the centroid of X, known as the leverage, is calculated based on the following equation:
hi = xi(XTX)−1xiT | (10) |
(11) |
The results of QSAR-MLR and QSAR-ANN were compared based on statistical parameters such as R2, RMSE, and MPE. The lower RMSE and higher R2 which were obtained by the ANN algorithm (6.4 and 0.9528, respectively) displayed the performance of ANN in detecting the relationship between ECs and their rejections with high predictive power. Moreover, MPEs of the whole data were 12.2% and 24.3% for ANN and MLR, respectively, a fact that confirms the superiority of ANN in predicting the rejection processes of ECs by RO membranes.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3ra03177b |
This journal is © The Royal Society of Chemistry 2023 |