Zhe
Wang
a,
Huiyong
Sun
a,
Xiaojun
Yao
b,
Dan
Li
a,
Lei
Xu
c,
Youyong
Li
d,
Sheng
Tian
d and
Tingjun
Hou
*ae
aCollege of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China. E-mail: tingjunhou@zju.edu.cn; tingjunhou@hotmail.com; Tel: +86-571-88208412
bState Key Laboratory of Quality Research in Chinese Medicine, Macau Institute For Applied Research in Medicine and Health, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau (SAR), China
cInstitute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
dInstitute of Functional Nano and Soft Materials (FUNSOM), Soochow University, Suzhou, Jiangsu 215123, China
eState Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang 310058, China
First published on 7th April 2016
As one of the most popular computational approaches in modern structure-based drug design, molecular docking can be used not only to identify the correct conformation of a ligand within the target binding pocket but also to estimate the strength of the interaction between a target and a ligand. Nowadays, as a variety of docking programs are available for the scientific community, a comprehensive understanding of the advantages and limitations of each docking program is fundamentally important to conduct more reasonable docking studies and docking-based virtual screening. In the present study, based on an extensive dataset of 2002 protein–ligand complexes from the PDBbind database (version 2014), the performance of ten docking programs, including five commercial programs (LigandFit, Glide, GOLD, MOE Dock, and Surflex-Dock) and five academic programs (AutoDock, AutoDock Vina, LeDock, rDock, and UCSF DOCK), was systematically evaluated by examining the accuracies of binding pose prediction (sampling power) and binding affinity estimation (scoring power). Our results showed that GOLD and LeDock had the best sampling power (GOLD: 59.8% accuracy for the top scored poses; LeDock: 80.8% accuracy for the best poses) and AutoDock Vina had the best scoring power (rp/rs of 0.564/0.580 and 0.569/0.584 for the top scored poses and best poses), suggesting that the commercial programs did not show the expected better performance than the academic ones. Overall, the ligand binding poses could be identified in most cases by the evaluated docking programs but the ranks of the binding affinities for the entire dataset could not be well predicted by most docking programs. However, for some types of protein families, relatively high linear correlations between docking scores and experimental binding affinities could be achieved. To our knowledge, this study has been the most extensive evaluation of popular molecular docking programs in the last five years. It is expected that our work can offer useful information for the successful application of these docking tools to different requirements and targets.
During the past two decades, a great variety of docking tools and programs, such as AutoDock,17 AutoDock Vina,18 LeDock,19 rDock,20 UCSF DOCK,21 LigandFit,22 Glide,23 GOLD,24 MOE Dock,25 and Surflex-Dock26, have been developed for both commercial and academic use.27–29 For a docking program, the two most critical components are the sampling algorithm and the scoring function, which determine its sampling power and scoring power, respectively. As far as we know, the popular sampling algorithms can be roughly divided into three categories: shape matching, systematic search (such as exhaustive search, fragmentation, and conformational ensemble), and stochastic search algorithms (such as Monte Carlo methods, genetic algorithms, Tabu search methods, and swarm optimization methods);30 and the popular scoring functions can be roughly grouped into three major classes: force field, empirical, and knowledge-based scoring functions.31–33 More recently, some quantum mechanical (QM) and semi-empirical QM (SQM) based scoring functions have been designed to capture the binding affinity trend and native pose identification.34,35 With the rapid development of computer hardware, the problem of sampling efficiency can be effectively or at least partially overcome, but it is still a huge challenge for available scoring functions to predict the binding affinities of diverse small molecules with high accuracy.36
Because different sampling strategies and scoring functions are employed by different docking programs, it is important to evaluate and compare the performance of these programs. The evaluation results can reveal the advantages and limitations of each docking program, which may help users to make reasonable choices among different docking programs. To date, a number of evaluation studies with the purpose of assessing the accuracy of different molecular docking programs and workflows have been reported.25,37–42 However, most important comparative studies providing evaluation benchmarks were published before 2011, and similar comparative studies in recent five years were quite limited. Previously in 2013, Damm-Ganamet et al. published a paper on the benchmark exercise from the Community Structure-Activity Resource (CSAR), which documented the evaluation of the results for binding pose, enrichment and relative ranking of blinded congeneric series submitted by 20 different research groups.43 Undoubtedly, the result of this work is very meaningful for the developer community, but it may be not quite useful for users as the software comparison is anonymous and most of the evaluated programs are customized versions or in-house programs that are not easily accessible. More recently, Tuccinardi and colleagues reported an extensive consensus docking reliability analysis by considering the consensus predictions of ten different docking procedures, and they found that consensus docking was able to not only predict the ligand binding pose better than any single docking program but also give hints concerning the reliability of the docking poses.44 With the rapid development of docking algorithms, many traditional docking programs have been updated and some new docking engines have been developed. However, the corresponding evaluation studies are antiquated and insufficient. Generally speaking, although a large number of comparative studies have been reported in the past two decades, it still remains difficult to determine which docking program is more suitable for specific targets. Therefore, extensive evaluation studies on the performance of popular docking programs and tools are still quite demanding.
In this study, we evaluated the capabilities of ten molecular docking programs to predict the ligand binding poses (sampling power) and rank the binding affinities (scoring power). The evaluated docking programs include five academic programs (AutoDock, AutoDock Vina, LeDock, rDock, and UCSF DOCK) and five commercial programs (LigandFit, Glide, GOLD, MOE Dock, and Surflex-Dock). The features of the evaluated docking programs are outlined in Table 1. Most commercial docking programs are quite expensive, and therefore it is expected that the commercial docking programs with stronger funding support may show better performance than the academic ones. According to our evaluation study, we want to answer the following question: do the commercial docking programs more dominant advantages than the academic ones? Among the evaluated programs, AutoDock, GOLD and Glide are the most commonly used docking programs by analyzing all docking publications from 1990 to 2013.27 Certainly, this does not mean that these three programs are more accurate than the other evaluated programs. According to our evaluation study, we want to answer the second question: do the more popular docking programs show better performance than the less popular ones? Meanwhile, two newly released docking programs, LeDock and rDock, were included in the evaluation study. Actually, we have tested a variety of new freely available programs in our routine docking test work, and the reason we selected LeDock and rDock rather than others is that they have relatively better accuracy and speed. Certainly, compared with other more traditional programs, these new programs may not be well validated and their performance is questionable. Therefore, according to our evaluation study, we want to answer the third question: do the traditional docking programs show better performance than the newly released ones?
Program | Feature | Website |
---|---|---|
AutoDock17 | LGA-based docking software. Free for academic use. Maintained by the Molecular Graphics Laboratory, Scripps Research Institute, La Jolla. | http://autodock.scripps.edu/ |
AutoDock Vina18 | AutoDock Vina employs an iterated local search global optimizer. Free for academic use. Maintained by the Molecular Graphics Laboratory, The Scripps Research Institute, La Jolla. | http://vina.scripps.edu/ |
LeDock19 | LeDock is based on a combination of simulated annealing and evolutionary optimization of the ligand pose (position and orientation) and its rotatable bonds, using a physics/knowledge hybrid scoring scheme derived from prospective virtual screening campaigns. Free for academic use. Maintained by Lephar Research. | http://lephar.com/ |
rDock20 | rDock is based on a combination of stochastic and deterministic search techniques (GA and MC) to generate low energy ligand poses. Free for academic use. Maintained by rDock Development Team | http://rdock.sourceforge.net/ |
UCSF DOCK21 | Anchor-and-grow based docking program. Free for academic use. Maintained by the Shoichet group at the University of California San Francisco. | http://dock.compbio.ucsf.edu/ |
LigandFit22 | Ligand conformations generated using Monte Carlo techniques are initially docked into an active site based on the shape, followed by further CHARMm minimization. Provided by Accelrys. | http://accelrys.com/ |
Glide23 | Exhaustive search-based docking program. It has extra precision (XP), standard precision (SP) and high-throughput virtual screening (HTVS) scoring modes. Provided by Schrödinger. | http://www.schrodinger.com/ |
GOLD24 | GA-based docking program. Product of collaboration between the University of Sheffield, GlaxoSmithKline, and the Cambridge Crystallographic Data Centre. | http://www.ccdc.cam.ac.uk/ |
MOE Dock25 | MOE Dock supplies a database of conformations or generates conformations on the fly, and then refines the poses using a force field based method with MM/GBVI. Distributed by Chemical Computing Group. | http://www.chemcomp.com/ |
Surflex-Dock26 | Docking program based on a “protomol” that can be automatically generated and/or user-defined. Poses are scored using an updated and re-parameterized empirical scoring function (based on the Hammerhead docking system). Distributed by Tripos. | http://www.tripos.com/ |
The scoring power, which represents the ability of a scoring function to rank the binding capabilities of studied molecules, was quantitatively evaluated by Pearson's correlation coefficient (rp) and Spearman's ranking coefficient (rs) between the docking scores and experimental binding data.
Fig. 3 Success rates (A) and consistent rates (B) of ten docking programs in the docking power test. Optimized ligands were used as the input and 2.0 Å was used as the RMSD cutoff. |
Among the free docking tools, LeDock and rDock exhibited an eye-catching performance on ligand pose prediction, and LeDock is even better than most commercial programs. As the authors mentioned, a combination of simulated annealing (SA) and genetic algorithm (GA) is used by LeDock to optimize the position, orientation, and rotatable bonds of the docked ligand.52 SA and GA are two popular machine learning algorithms that have been widely employed by many docking programs. However, integrating such two algorithms in one tool is still very rare. Our results suggest that employing blended sampling algorithms may be an expedient strategy to improve the sampling power of a docking program. LeDock is a new molecular docking program, and we even could not find enough technical details of this docking algorithm. But from the results of the present study, it exhibits a high accuracy with good speed (slightly faster than AutoDock Vina) and is a recommended program for the virtual screening task.
In AutoDock, two sampling methods, including the Lamarckian Genetic Algorithm (LGA) and Particle Swarm Optimization (PSO), were utilized to optimize the binding poses of each ligand within the protein binding pocket. As illustrated in Fig. 2A and C, both the success rates for the top scored poses and best poses predicted by AutoDock (PSO) are obviously higher than those predicted by AutoDock (LGA). In addition, we also found that the speed of the PSO version is much faster than that of the LGA version. AutoDock Vina, the new generation of AutoDock, was also included in our evaluation. As shown in Table 1, the predictions of AutoDock Vina are slightly better than those of AutoDock (PSO), but substantially better than those of AutoDock (LGA). Compared with the report of AutoDock Vina developers, we could find that our evaluation results of AutoDock (LGA) and AutoDock Vina were consistent with their findings (only the LGA version of AutoDock was compared), i.e., AutoDock Vina significantly improved the average accuracy of the binding mode predictions compared to AutoDock.18
By comparing Fig. 2B and D, we found that the success rate of Surflex-Dock for the best poses was 80.0% but that for the top scored poses was much lower (53.2%). The huge gap between the prediction accuracies for the top scored poses and best poses reveals that the pose ranking capability of Surflex-Dock may be unsatisfactory and needs to be improved. Another unforeseen outcome is that the performance of Glide with the XP scoring mode on the best poses is even worse than that of Glide with the SP scoring mode. In our previous studies, we also observed that the XP scoring did not always perform better than the SP scoring for many systems.53,54 By analyzing the binding conformations generated by Glide (SP) and Glide (XP), we found that the cluster number of binding conformations provided by XP was generally less than that provided by SP; in other words, the docking poses from SP have more diversity than those from XP, which may partially account for its better performance on the best poses.
Although the overall success rates of the top scored poses and best poses (Fig. 3A) can help us to distinguish the sampling power of the tested programs, it is still not comprehensive enough. In real cases, e.g., in virtual screening studies, the top scored poses are generally considered to be the most reasonable binding structures. However, we found that there was a big difference between the success rates for the top scored poses and best poses, suggesting that the top scored poses are usually not the best (or native) poses, which is mainly caused by the drawbacks of scoring functions. It is reported that some SQM-based scoring function may be employed in the late stages of virtual screening to overcome this unbalance.34 Here, the consistent rate was used to assess the consistency between the predictions for the top scored poses and best poses. The consistent rate is defined as SRtsp/SRbp, where SRtsp and SRbp are the success rates for the top scored poses and best poses, respectively. As shown in Fig. 3B, the consistent rate of Glide (XP) and GOLD are up to 87.7% and 82.5%, respectively. To some extent, these two programs may be more suitable for a virtual screening study.
Then, we analyzed the failure cases with large prediction errors. We found that a total of 72 crystal structures could not be well predicted by any docking program (Table S1, ESI†). We believe that the following two reasons account for the unsuccessful docking. First, it can be found in Table S2 (ESI†) that about 82.0% (59/72) of ligands in failure cases are not neutral, which means nowadays docking methodologies are still not accurate enough to predict charged systems. Second, the large flexibility of ligands is another key factor leading to failure. As listed in Table S2 (ESI†), more than half (40/72) of ligands contain over than 10 rotatable bonds.
In the PBDbind refined set database some ligands are small peptides or peptide mimics. The properties of peptides or peptide mimics are more similar to those of proteins, e.g., higher molecular weights and more rotatable bonds. Generally, a peptide or peptide mimic ligand is more difficult to be docked successfully. In order to conduct further investigation on the performances of the tested programs for peptides or peptide mimics, the whole dataset was separated into two groups: regular organic molecule ligands and peptide or peptide mimic ligands. The numbers of regular organic molecule ligands and peptide or peptide mimic ligands are 1843 and 159, respectively. The success rates of the two types of ligands are summarized in Table 2. As we expected, the predictions for organic ligands are significantly better than those for peptides or peptide mimic ligands for all docking programs. It is notable that for peptides or peptide mimic ligands Surflex-Dock achieves the success rates of 44.0% and 67.3% for the top scored poses and best poses, respectively.
Docking program | Regular organic molecule | Peptide or peptide mimic | ||
---|---|---|---|---|
Top scored pose | Best pose | Top scored pose | Best pose | |
AutoDock (LGA) | 0.378 | 0.559 | 0.216 | 0.324 |
AutoDock (PSO) | 0.477 | 0.686 | 0.331 | 0.439 |
AutoDock Vina | 0.485 | 0.726 | 0.384 | 0.597 |
LeDock | 0.574 | 0.808 | 0.352 | 0.465 |
rDock | 0.503 | 0.763 | 0.283 | 0.465 |
UCSF DOCK | 0.445 | 0.591 | 0.340 | 0.415 |
LigandFit | 0.479 | 0.689 | 0.267 | 0.504 |
Glide (SP) | 0.544 | 0.754 | 0.403 | 0.547 |
Glide (XP) | 0.584 | 0.666 | 0.403 | 0.484 |
GOLD | 0.599 | 0.726 | 0.371 | 0.472 |
MOE Dock | 0.457 | 0.612 | 0.195 | 0.245 |
Surflex-Dock | 0.533 | 0.800 | 0.440 | 0.673 |
As shown in Fig. 5, the cumulative distributions based on different starting conformations of the docking results of AutoDock, AutoDock Vina, LigandFit and GOLD have relatively large fluctuations, suggesting that these several docking programs are relatively more sensitive to the initial structures of ligands. We found that the pose prediction accuracies of original docking and rotated docking (input ligand structure has memory of the crystallized conformation) were generally better than those of optimized docking (input ligand structure ‘forgets’ the crystallized conformation) for most docking programs. This is consistent with the results reported by Onodera and colleagues, i.e., if the input ligand structure is similar to the native one, the better poses are usually predicted by docking programs.57 Among the tested docking programs, LeDock, rDock, UCSF DOCK, Glide, MOE Dock and Surflex-Dock are not sensitive to the starting conformations of ligands, and that is to say, the sampling algorithms implemented in these several docking programs are quite robust.
Fig. 5 Comparison of the top scored pose and best pose cumulative distribution of docking using different starting conformations. |
The Pearson correlation coefficient (rp) and Spearman ranking coefficient (rs) between the docking scores and experimental binding affinities for the entire test set are summarized in Table 3. The docking program with the best scoring power is AutoDock Vina, which produced rp (rs) of 0.564 (0.580) and 0.569 (0.584) for the top scored poses and best poses, respectively. The next two top-ranked docking tools are MOE Dock and GOLD, which gave rp (rs) of 0.564 (0.589) and 0.500 (0.511) for the top scored poses, respectively, and rp (rs) of 0.411 (0.457) and 0.494 (0.513) for the best poses, respectively. Unexpectedly, we found that there is no obvious difference in the scoring powers between the top scored poses and best poses for most docking programs, except MOE Dock. On the whole, the scoring powers of the tested docking programs on the entire test set are not quite satisfactory. Based on the Spearman ranking coefficients for the top scored poses, the performance of the academic programs can be ordered in the following way: AutoDock Vina (0.580) > AutoDock (PSO) (0.534) > LeDock (0.462) > UCSF DOCK (0.331) > rDock (0.017) and that of the commercial programs can be ordered in the following way: MOE Dock (0.589) > GOLD (0.515) > Glide (0.473) > Surflex-Dock (0.370) > LigandFit (0.221). Overall, compared with the academic programs, the commercial programs do not have improved capability to rank the binding affinities for a diverse dataset. Moreover, it seems that the good performance of a scoring function to identify correct binding poses cannot guarantee the good performance of this function to rank binding affinities. For example, rDock has relatively good sampling power, but its ranking power is quite low; GOLD has the best sampling power for the top scored poses, but its ranking power for the top scored poses is not the best. Apparently, there was no single docking program that outperformed all others in both sampling power and scoring power. Therefore, the best solution for docking-based virtual screening would be the combination of different docking tools into a single platform, which could be benefited from the advantages of different approaches. For example, we can use LeDock to virtually screen the chemical database, and then use AutoDock Vina or MOE Dock to rescore the top scored poses predicted by LeDock.
Docking program | Correlation coefficient | Top scored pose | Best pose |
---|---|---|---|
a r p represents Pearson's correlation coefficient. b r s represents Spearman's ranking coefficient. c The standard error was estimated by randomly sampling 80% of the tested dataset 100 repeats. | |||
AutoDock (LGA) | r p | 0.433 ± 0.009c | 0.404 ± 0.009 |
r s | 0.477 ± 0.008 | 0.450 ± 0.009 | |
AutoDock (PSO) | r p | 0.492 ± 0.008 | 0.466 ± 0.008 |
r s | 0.534 ± 0.007 | 0.513 ± 0.008 | |
AutoDock Vina | r p | 0.564 ± 0.008 | 0.569 ± 0.008 |
r s | 0.580 ± 0.008 | 0.584 ± 0.008 | |
LeDock | r p | 0.442 ± 0.009 | 0.463 ± 0.009 |
r s | 0.462 ± 0.010 | 0.486 ± 0.009 | |
rDock | r p | −0.015 ± 0.011 | −0.021 ± 0.011 |
r s | −0.017 ± 0.011 | −0.005 ± 0.011 | |
UCSF DOCK | r p | 0.291 ± 0.010 | 0.276 ± 0.011 |
r s | 0.331 ± 0.011 | 0.323 ± 0.011 | |
LigandFit | r p | −0.132 ± 0.011 | −0.105 ± 0.011 |
r s | −0.221 ± 0.012 | −0.192 ± 0.012 | |
Glide (SP) | r p | 0.444 ± 0.008 | 0.402 ± 0.009 |
r s | 0.473 ± 0.009 | 0.419 ± 0.010 | |
Glide (XP) | r p | 0.367 ± 0.010 | 0.356 ± 0.010 |
r s | 0.389 ± 0.010 | 0.374 ± 0.010 | |
GOLD | r p | −0.500 ± 0.008 | −0.494 ± 0.008 |
r s | −0.515 ± 0.008 | −0.513 ± 0.008 | |
MOE Dock | r p | 0.564 ± 0.008 | 0.411 ± 0.009 |
r s | 0.589 ± 0.009 | 0.457 ± 0.009 | |
Surflex-Dock | r p | −0.340 ± 0.009 | −0.350 ± 0.009 |
r s | −0.370 ± 0.009 | −0.382 ± 0.009 |
The scoring power of the same docking tool on different protein families is widely divergent, for example, the Pearson correlation coefficients of LeDock to the groups of b.47.1.2 and b.50.1.2 for the top scored poses are 0.698 and 0.770, respectively, while those to b.50.1.1 and c.94.1.1 are only −0.010 and 0.176, respectively. On the other hand, the performances of different docking programs on the same protein family are also intermingled. As shown in Fig. 6, the Pearson correlation coefficients of AutoDock, rDock, and LigandFit to the group of b.50.1.2 for both top scored poses and best poses are less than 0.500, while those of LeDock, Glide (XP), GOLD, and Surflex-Dock are around 0.700 or −0.700. This result fully explains the importance of selecting the right program. Exceptionally, we found that the performances of all investigated programs to the subset of b.50.1.1, the largest group, were much worse than the others. Actually, the proteins in the family b.50.1.1 are HIV proteases, and several reports have pointed out that both the lack of consideration of the entropic term in the scoring function and the narrow distribution of the experimental binding free energies may contribute to the low correlation.58–60
Among the tested programs, three of them, including AutoDock Vina, GOLD and MOE Dock, achieved the best scoring powers, with rp/rs of 0.564/0.580, 0.500/0.515 and 0.569/0.589 for the top scored poses, respectively. However, the relatively weak correlation between the docking scores and experimental binding affinities for the entire dataset indicates that current scoring functions are still not reliable and universal enough. Evaluation of the scoring powers on different protein families illustrates that the scoring powers of the same docking tool on different protein families are quite different (rp from 0.000 to 0.800) and therefore different docking programs may be used for different protein families.
Our evaluation results illustrate that no single docking program has dominative advantages than other programs. The combination of different docking tools into a single platform may be a practical method to achieve better predictions for docking-based virtual screening. To sum up, we made an updated comprehensive docking benchmark with emphasis on sampling power and scoring power, and we expect our work could provide new useful reference for people to select the most appropriate docking program for their projects.
Footnote |
† Electronic supplementary information (ESI) available: Fig. S1: distributions of five properties of 1790 FDA approved drugs; Table S1: the unsuccessful docking instances of the individual tested docking program in the benchmark; and Table S2: the formal charge and the number of rotatable bonds of ligands for the 72 failure cases that could not be successfully predicted by any docking program. See DOI: 10.1039/c6cp01555g |
This journal is © the Owner Societies 2016 |