High-accuracy QSAR models of narcosis toxicities of phenols based on various data partition, descriptor selection and modelling methods
Abstract
The environmental protection agency thinks that quantitative structure–activity relationship (QSAR) analysis can better replace toxicity tests. In this paper, we developed QSAR methods to evaluate the narcosis toxicities of 50 phenol analogues. We first built multiple linear regression (MLR), stepwise multiple linear regression (SLR) and support vector regression (SVR) models using five descriptors and three different partitions, and the optimal SVR models with all three training-test partitions had the highest external prediction ability, about 10% higher than the models in the literature. Second, to identify more effective descriptors, we applied two in-house methods to select descriptors with clear meanings from 1264 descriptors calculated by the PCLIENT software and used them to construct the MLR, SLR and SVR models. Our results showed that our best SVR model (Rpred2 = 0.972) significantly increased 16.55% on the test set, and the appropriate partition presented the better stability. The different partitions of the training-test datasets also supported the excellent predictive power of the best SVR model. We further evaluated the regression significance of our SVR model and the importance of each single descriptor of the model according to the interpretability analysis. Our work provided a valuable exploration of different combinations among data partition, descriptor selection and model and a useful theoretical understanding of the toxicity of phenol analogues, especially for such a small dataset.