MPSM-DTI: prediction of drug–target interaction via machine learning based on the chemical structure and protein sequence†
Abstract
Drug–target interaction (DTI) plays a central role in drug discovery. How to predict DTI quickly and accurately is a key issue. Traditional structure-based and ligand-based methods have some inherent deficiencies. Hence, it is necessary to develop a new method for DTI prediction that does not rely on crystal structures of protein targets or quantity and diversity of ligands. In this study, we collected 40 898 DTIs with kd values from ChEMBL 27 to develop a prediction method. Through data standardization, SMOTE sampling and pipeline techniques, among 30 models the Morgan-PSSM-SVM model (MPSM-DTI) was demonstrated as the best one with ten-fold cross-validation (F1 = 85.55 ± 0.46%, R = 84.89 ± 0.62% and P = 86.24 ± 0.81%) and test set validation (F1 = 85.11%, R = 84.34% and P = 85.90%). The results in two external validation sets indicated that the MPSM-DTI model had satisfactory generalization capability and could be used in target prediction for new compounds. Specifically, the F1, P and R values were 83.27%, 85.21% and 81.41% in external validation set 1 and 86.45%, 87.50% and 85.42% in external validation set 2. Via the latest literature evidence, we collected 100 new DTIs of eight GPCR targets to prove that MPSM-DTI could predict compounds for protein targets without known ligands and crystal structures. Compared with other DTI prediction methods, our method reached considerable accuracy and addressed the dilemma of DTI prediction for brand new protein targets. Furthermore, we proposed the pipeline encapsulation technique, which would avoid data leak and improve generalization ability of the model. The source code of the method is available at https://github.com/pengyayuan/MPSM-DTI.