NPred: QSAR classification model for identifying plant based naturally occurring anti-cancerous inhibitors†
Abstract
The prediction of naturally occurring plant based compounds as anticancer agents is the key to developing new chemical entities in the area of therapeutic oncology. Therefore, in the present study various machine learning techniques viz. Naive Bayesian classifier (NB), sequential minimal optimization (SMO), instance based learner (IBK) and random forest (RF) have been used to develop models of the relationship between the chemical structures of plant based natural compounds and their anti-cancerous inhibition activity. These models were trained, tested and validated using 549 active and 424 inactive compounds deposited in the NPACT database. We observe that the random forest based model using 881 PubChem fingerprints showed the best performance with an MCC of 0.54 and an accuracy of 77.6% on a five-fold cross-validation set and an MCC of 0.35 with an accuracy of 68.4% on an independent external validation set. Also, a frequency-based feature selection method was used to identify the fingerprints that have differential occurrence percentages in an active inhibitor dataset from an inactive set. We find that almost the entire top 10 fingerprints (FP797, FP818, FP12, FP179, FP3, FP143, FP712, FP704, FP334 and FP711) are present in vincristine, vinblastine and paclitaxel, the three therapeutic drugs that are derived from natural products and used as anticancer drugs in clinics. Finally, we have also developed a web server NPred, to predict the potential of natural compounds as anticancer agents and thus help the researchers working in this area. We expect that the results of this study will pave the way for identifying and designing novel natural products as cancer growth inhibitors.