Xiaoyang
Pang
a,
Yunyang
Zhao
a,
Guo
Li
a,
Jianrong
Liu
*b and
Aixia
Yan
*a
aState Key Laboratory of Chemical Resource Engineering, Department of Pharmaceutical Engineering, Beijing University of Chemical Technology, Beijing, P. R. China. E-mail: yanax@mail.buct.edu.cn
bBUCT-Paris Curie Engineer School, Beijing University of Chemical Technology, Beijing 100029, P. R. China. E-mail: liujianrong2017@126.com
First published on 31st May 2023
Cyclin dependent kinase 4 (CDK4) is a promising target for cancer treatment, and developing new effective CDK4 inhibitors is of great significance in anticancer therapy. In this study, we conducted a structure activity relationship (SAR) study on 3018 CDK4 inhibitors. We applied four machine learning methods, which were Multiple Linear Regression (MLR), Random Forest (RF), Support Vector Machine (SVM) and Deep Neural Network (DNN), to develop 18 classification models based on 3018 inhibitors (dataset 1), 18 classification models based on dataset 1 and decoys, and 24 quantitative structure–activity relationship (QSAR) models based on 1427 inhibitors (dataset 2). We obtained some optimal models. Based on dataset 1, Model A2, built by SVM and MACCS fingerprints, has a prediction accuracy (Q) of 92.68% and a Matthews correlation coefficient (MCC) of 0.874 for the test set. Based on dataset 1 and decoys, Model C2, built by SVM and MACCS fingerprints, has a Q of 98.5% and a MCC of 0.937 for the test set. Based on dataset 2, Model F7, built by SVM and MOE descriptors, has a coefficient of determination (R2) of 0.824 and a root mean squared error (RMSE) of 0.534 for the test set. For classification models, it was found that the more samples used for modelling, the more robust the models, and the better the performance of the models. Moreover, we clustered 3018 inhibitors into 12 subsets, and analysed their scaffolds and fragment features. It was found that 2-aminopyrimidine, pyridine, piperazine and cyclopentane were common scaffolds and fragments in highly active inhibitors. This study can provide guidance for the discovery and optimization of CDK4 inhibitor lead compounds.
There are five drugs that have been launched so far that target CDK4, namely palbociclib,6 ribociclib,7 abemaciclib,8 trilaciclib9 and dalpiciclib.10 Their structures and activities are shown in Table 1, and they are used to treat breast cancer11–15 and extensive-stage small cell lung cancer.16 But nowadays there are breast cancer patients resistant to CDK4 inhibitors and endocrine therapy in the clinic.17–20 Therefore, it is necessary to develop new CDK4 inhibitors.
Computer-aided drug design (CADD) as a convenient and low-cost approach21 is used to discover CDK4 inhibitor lead compounds. As a method of CADD, a classification model22–26 is used for structure–activity relationship research and virtual screening of inhibitors, and a QSAR model27–39 is used to predict the inhibitory activity of a compound. Wu et al.22 built classification models and QSAR models to study the structure–activity relationship of tyrosinase inhibitors. Huo et al.26 built classification models and QSAR models of EGFR inhibitors, and used them for virtual screening. 18 novel EGFR inhibitors were predicted to be highly active, and nine of the 18 novel EGFR inhibitors have been proved to be effective by experiments. As for the study of CDK4 inhibitors, Omar Husham Ahmed et al.39 designed 52 pyrido[2,3-d]pyrimidin-7-one-based CDK4 inhibitors, and then developed 2D- (R2 = 0.6974, RMSE = 0.7193) and 3D-quantitative structure–activity relationship (QSAR) models (R2 = 0.7649, RMSE = 0.5809). Virtual screening of the ChEMBL database was carried out using the validated QSAR model and the molecular docking procedure. A total of six compounds were identified as potentially novel CDK4 inhibitor lead compounds. Le et al.40 built 3D-QSAR models with comparative molecular field analysis (CoMFA) (CDK4: Q2 = 0.543, R2 = 0.967; CDK6: Q2 = 0.624, R2 = 0.984) and comparative molecular similarity indices analysis (CoMSIA) (CDK4: Q2 = 0.518, R2 = 0.937; CDK6: Q2 = 0.584, R2 = 0.975) based on 52 dual CDK4/6 inhibitors, and then designed 10 novel compounds with expected activity and ADME/T properties. Sarhan et al.38 synthesized novel 6-bromo-coumarin-ethylidene-hydrazonyl-thiazolyl and 6-bromo-coumarin-thiazolyl based derivatives, and built a QSAR model (R2 = 0.92, RMSE = 0.44) based on 16 previously reported thiazolyl-hydrazono-coumarin compounds. Five compounds predicted by the QSAR model were verified to have potential anticancer activities, and one of them was considered as a selective radiotherapy agent for solid tumours with promising anticancer activity based on the results of experiments in vitro. Therefore, the classification model and QSAR model are helpful for the discovery and optimization of CDK4 inhibitors.
The previous studies on CDK4 inhibitors were carried out based on a certain scaffold, so it is not easy to find novel scaffolds of inhibitors. To address this issue, in this work, 3018 CDK4 inhibitors were collected to research the structure–activity relationship in this study. We calculated MACCS fingerprints, ECFP4 fingerprints and Corina descriptors, using Random Forest, Support Vector Machine and Deep Neural Network methods to build classification models, to distinguish highly/weakly active inhibitors. Then we calculated Corina, MOE and RDKit descriptors, using Multiple Linear Regression, Random Forest, Support Vector Machine and Deep Neural Network methods to build QSAR models, to predict the CDK4 inhibitory activity of the compound. In addition, we clustered 3018 CDK4 inhibitors to 12 subsets and analysed the features of the scaffolds and fragments. This work aims to study the relationship between the structures and activities of CDK4 inhibitors, and investigate the scaffolds and fragment features of highly active CDK4 inhibitors.
We selected 200 nM as a threshold to distinguish highly and weakly active inhibitors. If the IC50 of one compound is less than 200 nM, it is defined as a highly active inhibitor with a label of 1; in contrast, if the IC50 of one compound is more than 200 nM, it is defined as a weakly active inhibitor with a label of 0. As a result, 1642 compounds were distinguished as highly active inhibitors, and 1376 compounds were distinguished as weakly active inhibitors. We divided dataset 1 into training set 1 (2266 inhibitors) and test set 1 (752 inhibitors) by a Self-Organizing Map (SOM)41 and divided dataset 1 into training set 2 (2263 inhibitors) and test set 2 (755 inhibitors) by a random method. SOM is performed by SONNIA software42 and MACCS fingerprints of compounds, and it enables the chemical space of the training set to cover that of the entire dataset as much as possible. The training set divided by SOM can have rich chemical structure diversity.
In addition, because it is well known that decoys are necessary to evaluate virtual screening methods,43 we also added decoys into dataset 1 to build classification models. To avoid decoy bias, we used two different decoy generation methods for the training set and test set, respectively.44
We used Deepcoy45 to generate decoys based on the training set. Training set 1 contains 2266 inhibitors, 1204 of which are highly active, and based on these highly active inhibitors, we generated 5951 decoys, training set 1 and these decoys form training set 3, which includes 8217 molecules. Training set 2 contains 2263 inhibitors, 1231 of which are highly active, and based on these highly active inhibitors, we generated 6085 decoys, training set 2 and these decoys form training set 4, which includes 8348 molecules. We used MUBD-DecoyMaker2.046 to generate decoys based on the test set. Test set 1 contains 752 inhibitors, 438 of which are highly active, and based on these highly active inhibitors, we generated 2377 decoys, test set 1 and these decoys form test set 3, which includes 3129 molecules. Test set 2 contains 755 inhibitors, 411 of which are highly active, and based on these highly active inhibitors, we generated 2379 decoys, test set 2 and these decoys form test set 4, which includes 3134 molecules.
For QSAR models, based on dataset 1, 1427 compounds whose IC50 was detected by a radiolabeling method were selected as dataset 2, with IC50 ranging from 0.13 nM to 1000 μM. Before modelling, we need to convert IC50 into pIC50 (pIC50 = log(IC50), in the unit of mol L−1), to eliminate the influence of magnitudes. pIC50 ranges from 3 to 9.89. We divided dataset 2 into training set 5 and test set 5 by SOM, and divided dataset 2 into training set 6 and test set 6 by a random method.
The extraction of descriptors is based on the training set. As for MACCS and ECFP4 fingerprints, we calculated the variance of each bit, and extracted the bits whose variance is larger than the mean variance. As for Corina descriptors, we calculated the Pearson coefficient51 between descriptors and activity, and then extracted the descriptors whose coefficient with activity is larger than 0.1, then when the coefficient between the two descriptors is larger than 0.9, we extracted the descriptor whose coefficient with activity is larger than another. As a result, we got a set of Corina descriptors for modelling. Before modelling, the selected Corina descriptors' values were scaled to [0.1, 0.9] using eqn (1):
(1) |
For classification models based on dataset 1 and decoys, we also calculated MACCS fingerprints, ECFP4 fingerprints and Corina descriptors. Based on the training set, as for MACCS and ECFP4 fingerprints, we calculated the variance of each bit to extract descriptors, and as for Corina descriptors, we calculated the Pearson coefficient between descriptors and activity to extract descriptors. Before modelling, the extracted Corina descriptors' values were scaled to [0.1, 0.9] using eqn (1).
For QSAR models, we calculated three types of descriptors for modelling, Corina descriptors, MOE descriptors52 and RDkit descriptors.50 The MOE descriptors consist of 206 descriptors calculated by MOE software.52 The RDkit descriptors consist of 208 descriptors calculated by the RDkit v2020.09.1 package.50 Based on the training set, we calculated the Pearson coefficient51 between descriptors and activity to extract descriptors, and then scaled the extracted descriptors using eqn (1) to build QSAR models.
For QSAR models, we used Multiple Linear Regression (MLR),64 Random Forest (RF),53 Support Vector Machine (SVM)54 and Deep Neural Network (DNN)55 methods to build models. Multiple Linear Regression64 belongs to linear regression and has poor fitting for nonlinear problems. Both Random Forest53 and Support Vector Machine54 can be used to solve classification and regression problems. As for the early-stopping function of DNN models, the evaluation parameter is mean squared error (MSE).65 The training strategy of QSAR models is the same as that of the classification models. The Scikit-learn v1.0.2 package56 was used to build MLR, RF and SVM models, while the Pytorch v1.10.0 package63 was used to build DNN models.
For QSAR models, the coefficient of determination (R2),68 mean absolute error (MAE)65 and root mean squared error (RMSE)65 were used to evaluate performance.
(2) |
For QSAR models, the Williams plot70 is used to visualize the application domain of the model. In the Williams plot, the standardized residuals (σ) and leverage (h) values of the compounds are calculated to determine whether they are in the application domain of the model. The σ value is the difference between the true value and predicted value. The h value of a compound can be calculated using eqn (3). When the leverage value of a compound is less than the leverage warning value (h*), it is in the application domain. The h* value can be calculated using eqn (4).
hi = xTi(XTX)−1xi | (3) |
h* = 3(p + 1)/n | (4) |
Model | Training set/test set | Input descriptors | Methods | Training set | Test set | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Type | n | Q (%) | 5-CVc (%) | MCCd | Q (%) | SEe (%) | SPf(%) | MCC | |||
a n, number of descriptors. b Q, accuracy. c 5-CV, 5-fold cross-validation. d MCC, Matthews correlation coefficient. e SE, sensitivity. f SP, specificity. | |||||||||||
Model A1 | 2266/752 | MACCS | 76 | RF | 85.79 | 82.44 | 0.715 | 90.03 | 90.41 | 89.49 | 0.796 |
Model A2 | 2266/752 | MACCS | 76 | SVM | 89.36 | 83.85 | 0.787 | 93.88 | 94.75 | 92.68 | 0.874 |
Model A3 | 2266/752 | MACCS | 76 | DNN | 87.03 | 85.99 | 0.739 | 90.43 | 89.04 | 92.36 | 0.807 |
Model A4 | 2266/752 | ECFP4 | 294 | RF | 87.29 | 84.91 | 0.745 | 91.49 | 89.95 | 93.63 | 0.829 |
Model A5 | 2266/752 | ECFP4 | 294 | SVM | 91.92 | 86.23 | 0.838 | 93.62 | 93.38 | 93.95 | 0.870 |
Model A6 | 2266/752 | ECFP4 | 294 | DNN | 87.95 | 86.96 | 0.759 | 90.56 | 88.58 | 93.31 | 0.811 |
Model A7 | 2266/752 | Corina | 24 | RF | 89.23 | 83.62 | 0.784 | 91.20 | 89.93 | 92.97 | 0.822 |
Model A8 | 2266/752 | Corina | 24 | SVM | 89.32 | 84.64 | 0.785 | 92.27 | 91.30 | 93.61 | 0.843 |
Model A9 | 2266/752 | Corina | 24 | DNN | 86.14 | 85.12 | 0.722 | 90.13 | 89.24 | 91.37 | 0.800 |
Model B1 | 2263/755 | MACCS | 76 | RF | 86.92 | 84.58 | 0.736 | 87.15 | 87.83 | 86.34 | 0.741 |
Model B2 | 2263/755 | MACCS | 76 | SVM | 91.21 | 86.30 | 0.823 | 88.48 | 90.27 | 86.34 | 0.768 |
Model B3 | 2263/755 | MACCS | 76 | DNN | 89.35 | 88.30 | 0.786 | 86.89 | 89.78 | 83.43 | 0.735 |
Model B4 | 2263/755 | ECFP4 | 293 | RF | 88.73 | 86.39 | 0.773 | 88.61 | 87.59 | 89.83 | 0.772 |
Model B5 | 2263/755 | ECFP4 | 293 | SVM | 93.28 | 87.58 | 0.865 | 89.93 | 90.02 | 89.83 | 0.798 |
Model B6 | 2263/755 | ECFP4 | 293 | DNN | 90.01 | 89.49 | 0.800 | 86.62 | 89.78 | 82.85 | 0.730 |
Model B7 | 2263/755 | Corina | 24 | RF | 90.53 | 86.02 | 0.809 | 85.70 | 85.89 | 85.47 | 0.712 |
Model B8 | 2263/755 | Corina | 24 | SVM | 93.01 | 87.08 | 0.859 | 86.49 | 86.62 | 86.34 | 0.728 |
Model B9 | 2263/755 | Corina | 24 | DNN | 86.95 | 86.68 | 0.738 | 87.02 | 89.78 | 83.72 | 0.738 |
According to Table 2, it was found that the differences between SE and SP of all models are very small, indicating that all models can balance different types of inhibitors very well. The 5-CV accuracies of all models are greater than 82%, indicating that all models are robust. The MCC values on the test set of all models are greater than 0.77, indicating that all models have good performance. The prediction accuracies on the training set and test set of all models are more than 85%, indicating that all models have good performance of fitting and prediction respectively.
Comparing the performances of models built by different methods, it was found that the models built by SVM have the best performance on the test set, and the models built by DNN have the best robustness. Comparing the performances of models based on different descriptors, it was found that the models based on ECFP4 fingerprints have the best performance on the test set. The models based on Corina descriptors have good performance with a small number of descriptors, indicating that the extracted Corina descriptors are very important for inhibitory activity.
Comparing the performances of models based on the two different training sets, the optimal models were selected according to the prediction MCC on the test set. It was found that among Models A1–A9 based on training set 1 (divided by SOM), Model A2 is the optimal model, which is built by MACCS fingerprints and SVM, and the prediction accuracy on the test set is 93.88%, and MCC is 0.874. Among Models B1–B9 based on training set 2 (divided by a random method), Model B5 is the optimal model, which is built by ECFP4 fingerprints and SVM, and the prediction accuracy on the test set is 89.93%, and MCC is 0.798. The two optimal models are both SVM models, indicating that SVM is a suitable method for this study.
Model | Threshold0.90a | Training set | Test set | ||||
---|---|---|---|---|---|---|---|
Coverageb(%) | 5-CVc (%) | MCCd | Coverage (%) | Q (%) | MCC | ||
a Threshold0.90, the threshold of the model when the accumulative accuracy of 5-fold cross-validation is close to 90%. b Coverage, the proportion of compounds corresponding to Threshold0.90. c 5-CV, 5-fold cross-validation. d MCC, Matthews correlation coefficient. e Q, accuracy | |||||||
Model A1 | 0.107 | 81.95 | 90.03 | 0.799 | 87.20 | 94.95 | 0.897 |
Model A2 | 0.114 | 82.66 | 90.01 | 0.798 | 87.60 | 96.35 | 0.925 |
Model A3 | 0.108 | 82.08 | 90.05 | 0.799 | 87.33 | 96.03 | 0.919 |
Model A4 | 0.131 | 83.98 | 90.01 | 0.799 | 88.40 | 95.93 | 0.917 |
Model A5 | 0.096 | 80.76 | 90.05 | 0.799 | 86.40 | 95.83 | 0.915 |
Model A6 | 0.107 | 81.95 | 90.03 | 0.799 | 87.20 | 96.33 | 0.925 |
Model A7 | 0.086 | 79.79 | 90.04 | 0.799 | 86.00 | 95.50 | 0.909 |
Model A8 | 0.089 | 80.23 | 90.04 | 0.799 | 86.27 | 94.28 | 0.884 |
Model A9 | 0.083 | 79.35 | 90.04 | 0.799 | 85.73 | 95.96 | 0.917 |
Model B1 | 0.176 | 87.04 | 90.04 | 0.798 | 90.07 | 91.03 | 0.819 |
Model B2 | 0.231 | 90.40 | 90.01 | 0.798 | 92.19 | 90.66 | 0.812 |
Model B3 | 0.222 | 89.87 | 90.00 | 0.798 | 92.05 | 90.94 | 0.817 |
Model B4 | 0.223 | 90.05 | 90.02 | 0.798 | 92.05 | 91.80 | 0.835 |
Model B5 | 0.186 | 87.48 | 90.04 | 0.798 | 90.46 | 90.34 | 0.806 |
Model B6 | 0.176 | 86.91 | 90.02 | 0.798 | 89.93 | 89.69 | 0.793 |
Model B7 | 0.154 | 85.71 | 90.04 | 0.798 | 88.87 | 90.46 | 0.807 |
Model B8 | 0.198 | 88.41 | 90.04 | 0.798 | 91.13 | 89.83 | 0.794 |
Model B9 | 0.191 | 87.75 | 90.02 | 0.799 | 90.73 | 90.95 | 0.817 |
According to Table 3, we found that within application domain, the prediction accuracy and MCC on the test set are improved. The threshold of Models A1–A9 ranges from 0.08 to 0.13, and the coverage of the training set and test set ranges from 80% to 87%, respectively. The threshold of Models B1–B9 ranges from 0.15 to 0.23, and the coverage of the training set and test set ranges from 87% to 90%, respectively. According to the MCC on the test set, Model A2 was the optimal model. Model A2 is the model built by MACCS fingerprints and SVM, with the threshold of 0.114. The coverage of the training set and test set is 82.66% and 87.6%, respectively, and the prediction accuracy on the test set is 96.35% and MCC is 0.925.
For training set 3, we calculated and extracted three types of descriptors (81 MACCS fingerprints, 299 ECFP4 fingerprints, and 13 Corina descriptors), using three machine learning methods (RF, SVM and DNN) to build Models C1–C9. For training set 4, we calculated and extracted three types of descriptors (81 MACCS fingerprints, 298 ECFP4 fingerprints, and 16 Corina descriptors), using three machine learning methods (RF, SVM and DNN) to build Models D1–D9. The performances of the 18 classification models are shown in Table 4. The optimal hyper-parameters of each model are given in the ESI.†
Model | Training set/test set | Input descriptors | Methods | Training set | Test set | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Type | n | Q (%) | 5-CV (%) | MCC | Q (%) | SE (%) | SP (%) | MCC | |||
Model C1 | 8217/3129 | MACCS | 81 | RF | 98.93 | 93.90 | 0.957 | 98.40 | 92.91 | 99.29 | 0.933 |
Model C2 | 8217/3129 | MACCS | 81 | SVM | 98.54 | 93.89 | 0.942 | 98.50 | 92.22 | 99.52 | 0.937 |
Model C3 | 8217/3129 | MACCS | 81 | DNN | 98.78 | 95.64 | 0.951 | 96.52 | 88.79 | 97.77 | 0.857 |
Model C4 | 8217/3129 | ECFP4 | 299 | RF | 99.99 | 95.85 | 1.000 | 97.67 | 92.22 | 98.55 | 0.903 |
Model C5 | 8217/3129 | ECFP4 | 299 | SVM | 99.94 | 95.61 | 0.998 | 98.34 | 93.14 | 99.18 | 0.930 |
Model C6 | 8217/3129 | ECFP4 | 299 | DNN | 98.47 | 90.96 | 0.938 | 96.74 | 85.81 | 98.51 | 0.862 |
Model C7 | 8217/3129 | Corina | 13 | RF | 99.60 | 93.55 | 0.984 | 97.12 | 82.61 | 99.48 | 0.876 |
Model C8 | 8217/3129 | Corina | 13 | SVM | 96.73 | 94.39 | 0.868 | 96.71 | 86.27 | 98.40 | 0.861 |
Model C9 | 8217/3129 | Corina | 13 | DNN | 94.01 | 93.24 | 0.750 | 94.95 | 76.20 | 97.99 | 0.781 |
Model D1 | 8348/3134 | MACCS | 81 | RF | 99.09 | 95.29 | 0.964 | 96.49 | 82.97 | 98.53 | 0.842 |
Model D2 | 8348/3134 | MACCS | 81 | SVM | 96.61 | 95.27 | 0.863 | 96.49 | 82.48 | 98.60 | 0.841 |
Model D3 | 8348/3134 | MACCS | 81 | DNN | 98.97 | 97.22 | 0.959 | 93.94 | 79.81 | 96.07 | 0.741 |
Model D4 | 8348/3134 | ECFP4 | 298 | RF | 99.99 | 96.31 | 1.000 | 97.19 | 86.37 | 98.82 | 0.874 |
Model D5 | 8348/3134 | ECFP4 | 298 | SVM | 97.07 | 96.24 | 0.884 | 97.32 | 88.81 | 98.60 | 0.881 |
Model D6 | 8348/3134 | ECFP4 | 298 | DNN | 98.51 | 91.81 | 0.940 | 96.20 | 80.78 | 98.53 | 0.828 |
Model D7 | 8348/3134 | Corina | 16 | RF | 99.69 | 94.85 | 0.988 | 95.88 | 75.67 | 98.93 | 0.810 |
Model D8 | 8348/3134 | Corina | 16 | SVM | 96.62 | 95.40 | 0.863 | 95.98 | 82.73 | 97.98 | 0.821 |
Model D9 | 8348/3134 | Corina | 16 | DNN | 94.69 | 93.33 | 0.782 | 94.54 | 73.72 | 97.69 | 0.751 |
According to Table 4, we found that the differences between SE and SP of most models are within the permissible limits, especially the models based on training set 3, indicating that the models can balance different types of inhibitors very well. The 5-CV accuracies of all models are greater than 90%, indicating that all models are robust. The MCC values on the test set of all models are greater than 0.74, indicating that all models have good performance. The prediction accuracies on the training set and test set of all models are more than 93%, indicating that all models have good performance of fitting and prediction respectively.
Compared to the models without decoys (Models A1–A9 and B1–B9 in Table 2), the 5-CV accuracies, average MCC and prediction accuracies for the training set and test set get better, indicating that the more samples used for modelling, the more robust the models, and the better the performance of the models.
While the differences between SE and SP increased to some extent, the reason for which is that the number of weakly active inhibitors has increased significantly, the different types of inhibitors used for modelling are unevenly distributed.
Comparing the performances of models built by different methods, it was found that the models built by SVM and RF have better performance on the test set than other models. Comparing the performances of models based on different descriptors, it was found that the models based on ECFP4 fingerprints have the best performance on the test set.
Comparing the performances of models based on the two different training sets, the optimal models were selected according to the prediction MCC on the test set. It was found that among Models C1–C9 based on training set 3, Model C2 is the optimal model, which is built by MACCS fingerprints and SVM, and the prediction accuracy on the test set is 98.5%, while MCC is 0.937. Among Models D1–D9 based on training set 4, Model D5 is the optimal model, which is built by ECFP4 fingerprints and SVM, and the prediction accuracy on test set 4 is 87.32%, while MCC is 0.881.
Model | Training set/test set | Input descriptors | Methods | Training set | Test set | |||||
---|---|---|---|---|---|---|---|---|---|---|
Type | n | R 2 | MAEc | RMSEd | R 2 | MAE | RMSE | |||
a n, number of descriptors. b R 2, coefficient of determination. c MAE, mean absolute error. d RMSE, root mean squared error. | ||||||||||
Model E1 | 1061/366 | Corina | 24 | MLR | 0.621 | 0.605 | 0.764 | 0.588 | 0.647 | 0.816 |
Model E2 | 1061/366 | Corina | 24 | RF | 0.901 | 0.311 | 0.390 | 0.774 | 0.453 | 0.605 |
Model E3 | 1061/366 | Corina | 24 | SVM | 0.920 | 0.279 | 0.351 | 0.805 | 0.421 | 0.561 |
Model E4 | 1061/366 | Corina | 24 | DNN | 0.886 | 0.330 | 0.418 | 0.778 | 0.460 | 0.598 |
Model E5 | 1061/366 | MOE | 33 | MLR | 0.689 | 0.547 | 0.692 | 0.650 | 0.595 | 0.752 |
Model E6 | 1061/366 | MOE | 33 | RF | 0.905 | 0.304 | 0.383 | 0.764 | 0.462 | 0.617 |
Model E7 | 1061/366 | MOE | 33 | SVM | 0.927 | 0.214 | 0.336 | 0.793 | 0.437 | 0.579 |
Model E8 | 1061/366 | MOE | 33 | DNN | 0.932 | 0.250 | 0.323 | 0.789 | 0.440 | 0.583 |
Model E9 | 1061/366 | RDkit | 31 | MLR | 0.671 | 0.564 | 0.712 | 0.660 | 0.581 | 0.743 |
Model E10 | 1061/366 | RDkit | 31 | RF | 0.904 | 0.306 | 0.385 | 0.796 | 0.432 | 0.574 |
Model E11 | 1061/366 | RDkit | 31 | SVM | 0.936 | 0.217 | 0.313 | 0.805 | 0.410 | 0.562 |
Model E12 | 1061/366 | RDkit | 31 | DNN | 0.932 | 0.250 | 0.323 | 0.770 | 0.460 | 0.610 |
Model F1 | 1050/357 | Corina | 27 | MLR | 0.655 | 0.548 | 0.729 | 0.567 | 0.639 | 0.838 |
Model F2 | 1050/357 | Corina | 27 | RF | 0.901 | 0.311 | 0.391 | 0.744 | 0.462 | 0.644 |
Model F3 | 1050/357 | Corina | 27 | SVM | 0.918 | 0.279 | 0.356 | 0.807 | 0.427 | 0.559 |
Model F4 | 1050/357 | Corina | 27 | DNN | 0.916 | 0.280 | 0.359 | 0.791 | 0.441 | 0.582 |
Model F5 | 1050/357 | MOE | 43 | MLR | 0.721 | 0.512 | 0.655 | 0.657 | 0.578 | 0.745 |
Model F6 | 1050/357 | MOE | 43 | RF | 0.908 | 0.298 | 0.376 | 0.763 | 0.444 | 0.619 |
Model F7 | 1050/357 | MOE | 43 | SVM | 0.969 | 0.160 | 0.218 | 0.824 | 0.404 | 0.534 |
Model F8 | 1050/357 | MOE | 43 | DNN | 0.943 | 0.225 | 0.296 | 0.807 | 0.427 | 0.559 |
Model F9 | 1050/357 | RDkit | 45 | MLR | 0.713 | 0.518 | 0.665 | 0.665 | 0.564 | 0.737 |
Model F10 | 1050/357 | RDkit | 45 | RF | 0.910 | 0.297 | 0.373 | 0.784 | 0.435 | 0.592 |
Model F11 | 1050/357 | RDkit | 45 | SVM | 0.939 | 0.216 | 0.306 | 0.790 | 0.440 | 0.583 |
Model F12 | 1050/357 | RDkit | 45 | DNN | 0.939 | 0.235 | 0.306 | 0.774 | 0.457 | 0.605 |
We selected R2 as the most important evaluation standard, and found that except for the multiple linear regression models, the prediction R2 of the other 18 models on the test set was greater than 0.74, indicating that the 18 QSAR models have good performances. The range of RMSE on the test set is 0.534–0.644, and the range of MAE is 0.404–0.579, that means the prediction error is within the acceptable range.
Comparing the performances of models built by different methods, it was found that the models built by SVM have the best performance on the test set, indicating that SVM is the best method for building QSAR models. Comparing the performances of models based on different descriptors, it was found that the models built by MOE descriptors have the best performance on the test set, indicating that the extracted MOE descriptors are very important for inhibitory activity.
Comparing the performances of models based on the two different training sets, the optimal models were selected according to the prediction R2 on the test set. It was found that among Models E1–E12 based on training set 5 (divided by SOM), Model E3 is the optimal model, which is built by Corina descriptors and SVM, and the prediction R2 on the test set is 0.805, and RMSE is 0.561. Among Models F1–F12 based on training set 6 (divided by a random method), Model F7 is the optimal model, which is built by MOE descriptors and SVM, and the prediction R2 on the test set is 0.824, and RMSE is 0.534. Furthermore, we draw the prediction results of Model E3 and Model F7, as shown in Fig. 1.
Table 6 shows the performances of the two optimal models in the application domain. It was found that the RMSE of the two models on the test set shows an improvement, indicating that the application domain can screen out some outliers with large prediction deviation, and the prediction of compounds in the application domain is considered to be reliable.
Model | Training set | Test set | ||||
---|---|---|---|---|---|---|
R 2 | MAE | RMSE | R 2 | MAE | RMSE | |
Model E3 | 0.920 | 0.279 | 0.351 | 0.805 | 0.421 | 0.561 |
Model E3 (AD) | 0.918 | 0.280 | 0.354 | 0.828 | 0.403 | 0.519 |
Model F7 | 0.969 | 0.160 | 0.218 | 0.824 | 0.404 | 0.534 |
Model F7 (AD) | 0.968 | 0.161 | 0.219 | 0.827 | 0.400 | 0.528 |
On the basis of dataset 1, we clustered inhibitors based on their ECFP4 fingerprints. The clustering methods are K-means and hierarchical clustering, while the dimension-reduction methods are TSNE and UMAP, so we tried two clustering methods and four combinations (TSNE + K-means, TSNE + HC, UMAP + K-means, and UMAP + HC) which are dimension reduction followed by clustering, so there are 6 different methods. We set the number of categories to 11, and carried out 6 methods to clustering, and then evaluated the results, as shown in Table 7.
Method | Silhouette | CH score | DB score |
---|---|---|---|
K-Means | 0.13 | 144.82 | 2.39 |
HC | 0.13 | 145.40 | 2.71 |
TSNE + K-means | 0.53 | 4882.42 | 0.64 |
TSNE + HC | 0.50 | 4414.13 | 0.66 |
UMAP + K-means | 0.67 | 14780.75 | 0.45 |
UMAP + HC | 0.64 | 12900.34 | 0.49 |
According to Table 7, it was found that UMAP + K-means has the best performance in the 6 different methods, with the lowest values of silhouette coefficient and DB score, and the highest value of CH score. So, we used this method to explore the appropriate number of categories. We set the number from 9 to 15, and evaluated the clustering results, as shown in Table 8.
According to Table 8, it was found that when the number is 12, the clustering result has the best silhouette coefficient, good DB score and best CH score. Therefore, we chose 12 as the number of categories.
We clustered 3018 CDK4 inhibitors into 12 subsets by UMAP and K-means, and visualized them by the UMAP two-dimensional vector. Fig. 3 shows the clustering result.
According to Fig. 3, it was found that the distribution of each subset is relatively concentrated, and there is a clear boundary between them and other subsets. The percentages of highly and weakly active inhibitors in the 12 subsets are shown in Fig. 4. And it was found that subsets 1, 3 and 6 have a large proportion of highly active inhibitors, while subsets 4 and 7 have a large proportion of weakly active inhibitors.
Since dimension reduction will lead to the loss of molecular structure information of compounds, the use of UMAP two-dimensional vectors for clustering will cause the compounds in the same subset to maybe have different molecular scaffolds. We analysed the main scaffolds with a high number of compounds in each subset, and analysed fragment features of inhibitors containing these scaffolds by using statistical methods. Table 9 shows the scaffolds and fragment features on each subset.
Subset (high/weak) | Main scaffolds (high/weak) | Fragment | |||
---|---|---|---|---|---|
Structure | p_Higha | p_Weaka | Diffb | ||
a p_High and p_Weak are the frequency of the fragment in highly and weakly active inhibitors, respectively. b Diff, difference between p_High and p_Weak. | |||||
Subset 1 (469/68) | 0.66 | 0.34 | 0.32 | ||
0.41 | 0.65 | −0.24 | |||
0.20 | 0.40 | −0.20 | |||
Subset 2 (121/214) | 0.44 | 0.05 | 0.39 | ||
0.84 | 0.47 | 0.37 | |||
0.35 | 0.03 | 0.33 | |||
0.38 | 0.11 | 0.27 | |||
0.07 | 0.23 | 0.15 | |||
— | 0.73 | — | |||
— | 0.73 | — | |||
0.60 | 0.12 | 0.48 | |||
0.60 | 0.16 | 0.44 | |||
0.40 | 0.04 | 0.36 | |||
0.00 | 0.84 | −0.84 | |||
0.00 | 0.84 | −0.84 | |||
0.00 | 0.84 | −0.84 | |||
Subset 3 (311/28) | 0.82 | — | — | ||
0.77 | — | — | |||
0.47 | — | — | |||
Subset 4 (6/93) | — | 0.95 | — | ||
— | 0.78 | — | |||
Subset 5 (64/160) | 0.75 | 0.47 | 0.28 | ||
0.56 | 0.36 | 0.20 | |||
0.25 | 0.10 | 0.15 | |||
0.00 | 0.12 | −0.12 | |||
0.63 | 0.06 | 0.57 | |||
0.38 | 0.03 | 0.35 | |||
0.25 | 0.09 | 0.16 | |||
0.00 | 0.32 | −0.32 | |||
0.38 | 0.62 | −0.24 | |||
0.86 | 0.63 | 0.22 | |||
0.21 | 0.03 | 0.18 | |||
0.25 | 0.53 | −0.28 | |||
0.11 | 0.33 | −0.23 | |||
0.11 | 0.30 | −0.19 | |||
Subset 6 (153/3) | 0.89 | — | — | ||
0.87 | — | — | |||
0.82 | — | — | |||
Subset 7 (21/249) | — | 0.95 | — | ||
— | 0.71 | — | |||
Subset 8 (52/101) | 0.55 | 0.22 | 0.33 | ||
0.34 | 0.07 | 0.27 | |||
0.32 | 0.13 | 0.19 | |||
0.14 | 0.40 | −0.26 | |||
0.16 | 0.35 | −0.19 | |||
1.00 | 0.48 | 0.52 | |||
1.00 | 0.48 | 0.52 | |||
0.60 | 0.28 | 0.32 | |||
0.00 | 0.40 | −0.40 | |||
Subset 9 (112/247) | 0.35 | 0.14 | 0.21 | ||
0.00 | 0.19 | −0.19 | |||
0.27 | 0.00 | 0.27 | |||
0.00 | 0.25 | −0.25 | |||
0.02 | 0.25 | −0.23 | |||
— | 0.63 | — | |||
Subset 10 (86/26) | 0.88 | — | — | ||
0.80 | — | — | |||
0.63 | — | — | |||
Subset 11 (168/108) | 0.82 | 0.31 | 0.51 | ||
0.64 | 0.21 | 0.42 | |||
0.58 | 0.69 | −0.11 | |||
Subset 12 (79/79) | — | 0.57 | — | ||
— | 0.57 | — | |||
— | 0.5 | — | |||
0.98 | — | — | |||
0.34 | — | — | |||
0.57 | 0.09 | 0.48 | |||
1.00 | 0.70 | 0.30 | |||
1.00 | 0.78 | 0.22 | |||
0.29 | 0.96 | −0.67 | |||
0.29 | 0.78 | −0.50 |
According to Table 9, it was found that the compounds in subsets 1, 3, 7, 10, and 11 are 2-aminopyrimidine derivatives. The main scaffold of subset 1 is 9H-pyrido[4′,3′:4,5]pyrrolo[2,3-d]pyrimidin-2-amine, there were 425 highly active inhibitors and 65 weakly active inhibitors, and it was found that cyclopentane was more likely to appear in highly active inhibitors, while the hydroxyl group was more likely to appear in weakly active inhibitors. The main scaffold of subset 3 is 2-amino-N,N-dimethyl-7H-pyrrolo[2,3-d]pyrimidine-6-carboxamide, ribociclib7 has this scaffold, there were 295 highly active inhibitors and 16 weakly active inhibitors, and it was found that cyclopentane, pyridine and piperazine were the most frequent fragments in the highly active inhibitors. The main scaffold of subset 7 is 9-(prop-2-yl)purin-2-amine, there were 21 highly active inhibitors and 224 weakly active inhibitors, and it was found that butane and cyclohexanamine were the most frequent fragments in the weakly active inhibitors. The main scaffold of subset 10 is 6,7,8,9-tetrahydropyrazino[2′,1′:5,1]pyrrolo[2,3-d]pyrimidin-2-amine, trilaciclib9 has this scaffold, there were 84 highly active inhibitors and 2 weakly active inhibitors, and it was found that pyridine, piperazine and 7′,8′-dihydro-6′H-spiro[cyclohexane-1,9′-pyrazino[2′,1′:5,1]pyrrolo[2,3-d]pyrimidine]-6′-one were the most frequent fragments in the highly active inhibitors. The main scaffold of subset 11 is 2-amino-7,8-dihydropyrido[2,3-d]pyrimidin-7-one, palbociclib6 has this scaffold, there were 99 highly active inhibitors and 61 weakly active inhibitors, and it was found that cyclopentane and piperazine were more likely to appear in highly active inhibitors, while benzene was more likely to appear in weakly active inhibitors.
The compounds in subsets 2, 4, and 6 are pyrazol derivatives. The compounds in subset 2 contain three main scaffolds, the first one is 2,4-dihydroindeno[1,2-c]pyrazol-4-one, there were 82 highly active inhibitors and 75 weakly active inhibitors, and it was found that 2-methyldiazane-1-carbaldehyde, the hydroxyl group, 1,4-oxazinane and thiophene were more likely to appear in highly active inhibitors, while isobutane was more likely to appear in weakly active inhibitors; the second one is quinazoline, there were 44 weakly active inhibitors, and it was found that fluoroform and isobutane were the most frequent fragments in the weakly active inhibitors; the third one is 1H-indazole, there were 5 highly active inhibitors and 25 weakly active inhibitors, and it was found that ethanol, phenol and methylbenzene were more likely to appear in highly active inhibitors, while formic acid, 1λ6-1,2-thiazolidine-1,1-dione and methanethiol were more likely to appear in weakly active inhibitors. The main scaffold of subset 4 is 4,5-dihydro-1H-pyrazolo[3,4-d]pyrimidin-4-one, there were 6 highly active inhibitors and 88 weakly active inhibitors, and it was found that 1,3-dichlorobenzene and benzene were the most frequent fragments in the weakly active inhibitors. The main scaffold of subset 6 is 4-fluoro-1H-benzo[d]imidazole, abemaciclib8 has this scaffold, there were 126 highly active inhibitors and 3 weakly active inhibitors, and it was found that pyridine, 5-fluoropyrimidine and 4-fluoro-2-methyl-1-(prop-2-yl)benzo[d]imidazole were the most frequent fragments in the highly active inhibitors.
The compounds in subset 5 contain three main scaffolds, the first one is 2-(phenylamino)pyrimidine, there were 16 highly active inhibitors and 58 weakly active inhibitors, and it was found that propane, phenol and 2-methyl-1-(prop-2-yl)imidazole were more likely to appear in highly active inhibitors, while 3-methyl-3H,2H-1,3-thiazol-2-one was more likely to appear in weakly active inhibitors; the second one is 4-(phenylamino)pyrimidine, there were 8 highly active inhibitors and 34 weakly active inhibitors, and it was found that azanesulfonamide, acetonitrile and piperazine were more likely to appear in highly active inhibitors, while methanesulfonamide and propan-2-ol were more likely to appear in weakly active inhibitors; the third one is pyrazolo[5,1-f][1,2]diazine, there were 28 highly active inhibitors and 30 weakly active inhibitors, and it was found that benzene and piperazine were more likely to appear in highly active inhibitors, while propane, 2-(cyclopropylamino)pyrimidine and pyrazolo[5,1-f][1,2]diazin-6-ol were more likely to appear in weakly active inhibitors.
The compounds in subset 8 contain two main scaffolds, the first one is 4-[(Z)-aminomethylidene]-1,2,3,4-tetrahydroisoquinoline-1,3-dione, there were 44 highly active inhibitors and 55 weakly active inhibitors, and it was found that phenol, pyridine and benzene-1,2-diol were more likely to appear in highly active inhibitors, while piperazine and hexahydropyridine were more likely to appear in weakly active inhibitors; the second one is 5,7-dihydroxy-4H-chromen-4-one, there were 5 highly active inhibitors and 25 weakly active inhibitors, and it was found that isobutane, chlorobenzene and 1-methylhexahydropyridin-3-ol were more likely to appear in highly active inhibitors, while 5,7-dihydroxy-8-(1-methyl-1,2,3,6-tetrahydropyridin-4-yl)-4H-chromen-4-one was more likely to appear in weakly active inhibitors.
The compounds in subset 9 contain three main scaffolds, the first one is 6,7,12,13-tetrahydro-5H-pyrrolo[4,3-c]indolo[2,3-a]carbazol-5-one, there were 40 highly active inhibitors and 21 weakly active inhibitors, and it was found that propan-1-ol was more likely to appear in highly active inhibitors, while 12,13-dimethyl-6H-pyrrolo[4,3-c]indolo[2,3-a]carbazole-5,7-dione was more likely to appear in weakly active inhibitors; the second one is 5,11-dihydro-3H-pyrrolo[4,3-c]indolo[6,7-a]carbazole-4,6-dione, there were 63 highly active inhibitors and 8 weakly active inhibitors, and it was found that 9-hydroxy-3-methyl-5,11-dihydropyrrolo[4,3-c]indolo[6,7-a]carbazole-4,6-dione was more likely to appear in highly active inhibitors, while 3,8-dimethyl-5,11-dihydropyrrolo[4,3-c]indolo[6,7-a]carbazole-4,6-dione was more likely to appear in weakly active inhibitors; the third one is N-[2-(1H-indol-3-yl)ethyl]-N-methylbenzamide, there were 62 weakly active inhibitors, and it was found that 2,3,4,9-tetrahydro-1H-pyrido[3,4-b]indole-2-carbaldehyde was the most frequent fragment in the weakly active inhibitors.
The compounds in subset 12 contain three main scaffolds, the first one is imidazo[1,2-a]pyridine, there were 30 weakly active inhibitors, and it was found that 1,3-dichlorobenzene, dimethylamine and 1,3-difluorobenzene were the most frequent fragments in the weakly active inhibitors; the second one is phenyl[2-(phenylamino)-1,3-thiazol-5-yl]methanone, there were 58 highly active inhibitors and 1 weakly active inhibitor, and it was found that methanol and fluorobenzene were the most frequent fragments in the highly active inhibitors; the third one is 2-amino-1,3-thiazole-5-thiol, there were 7 highly active inhibitors and 23 weakly active inhibitors, and it was found that pyridine and isobutane were more likely to appear in highly active inhibitors, while formaldehyde and formic acid were more likely to appear in weakly active inhibitors.
In summary, it was found that piperazine, cyclopentane, pyridine and 2-aminopyrimidine were common fragments in highly active inhibitors.
We considered adding decoys to train models and evaluate their performance, because decoys are necessary to evaluate virtual screening methods. Therefore, we used two different decoy generation methods in training and test sets, respectively. Then, MACCS fingerprints, ECFP4 fingerprints and Corina descriptors were calculated and extracted. The 18 classification models were built by inputting three types of descriptors and using Random Forest, Support Vector Machine and Deep Neural Network methods. Model C2 and Model D5 are the optimal models based on the two different training sets. The prediction accuracy of Model C2 on the test set is 98.5% and MCC is 0.937. The prediction accuracy of Model D5 on the test set is 87.32% and MCC is 0.881. Compared to the other 18 classification models (Models A1–A9 and B1–B9 in Table 2), it was found that the more samples used for modelling, the more robust the models, and the better the performance of the models.
On the basis of 3018 CDK4 inhibitors, we selected 1427 CDK4 inhibitors whose IC50 was detected by a radiolabeling method to build QSAR models. Two training sets and test sets were divided by SOM and random method. Corina, MOE and RDkit descriptors of the compounds were calculated and extracted, and 24 QSAR models were built by inputting three types of descriptors and using Multiple Linear Regression, Random Forest, Support Vector Machine and Deep Neural Network methods. Model E3 and Model F7 are the optimal models based on the two different training sets. The prediction R2 of Model E3 on the test set is 0.805 and RMSE is 0.561. The prediction R2 of Model F7 on the test set is 0.824 and RMSE is 0.534. The application domain of Model E3 and Model F7 was calculated, and the performances of the models in the application domain were recalculated, and it was found that the RMSE predicted on the test set had improved in the application domain, indicating that some outliers with large prediction deviation could be deleted in the application domain, and the compound prediction in the application domain was considered to be reliable.
In addition, we used UMAP and K-means to cluster 3018 inhibitors into 12 subsets and analysed the scaffolds and fragment features of CDK4 inhibitors. It was found that piperazine, cyclopentane, 2-aminopyrimidine and pyridine were important structures in the highly active inhibitors.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2dd00143h |
This journal is © The Royal Society of Chemistry 2023 |