Xiuming Li‡
a,
Xin Yan‡*a,
Yuedong Yangc,
Qiong Gua,
Huihao Zhoua,
Yunfei Duc,
Yutong Luc,
Jielou Liaod and
Jun Xu*ab
aResearch Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou 510006, China. E-mail: junxu@biochemomes.com
bSchool of Computer Science & Technology, Wuyi University, 99 Yingbin Road, Jiangmen 529020, China
cNational Supercomputer Center in Guangzhou, School of Data and Computer Science, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou 510006, China
dDepartment of Chemical Physics, University of Science and Technology of China, Jinzhai Road 96, Hefei 230026, China
First published on 29th January 2019
Similar structures having similar activities is a dogma for identifying new functional molecules. However, it is not rare that a minor structural change can cause a significant activity change. Methods to measure the molecular similarity can be classified into two categories of overall three-dimensional shape based methods and local substructure based methods. The former states the relation between overall similarity and activity, and is represented by conventional similarity algorithms. The latter states the relation between local substructure and activity, and is represented by conventional substructure match algorithms. Practically, the similarity of two molecules with similar activity depends on the contributions from both overall similarity and local substructure match. We report a new tool termed as a local-weighted structural alignment (LSA) tool for pharmaceutical virtual screening, which computes the similarity of two molecular structures by considering the contributions of both overall similarity and local substructure match. LSA consists of three steps: (1) mapping a common substructure between two molecular topological structures; (2) superimposing two three-dimensional molecular structures with substructure focus; (3) computing the similarity score based on superimposing. LSA has been validated with 102 testing compound libraries from DUD-E collection with the average AUC (the area under a receiver-operating characteristic curve) value of 0.82 and an average EF1% (the enrichment factor at top 1%) of 27.0, which had consistently better performance than conventional approaches. LSA is implemented in C++ and run on Linux and Windows systems.
In medicinal chemistry, functional groups (substructures) at a molecule do not contribute to the activity equally. One substructure13 can be significantly more important than the other substructures, and is termed as a privileged substructure (or fragment).14 Fig. 1 shows an HDAC (histone deacetylase)15 inhibitor and its privileged substructure (highlighted in red circle). This substructure is the core substructure because a pan HDAC inhibitor must have a chelator “warhead” binding Zn2+ ion. Without this core substructure, the agent will not be active regardless of how the rest of the molecule is similar to an HDAC inhibitor. A substructure match algorithm (such as GMA8) can be employed to determine if a molecule is qualified for a potential HDAC inhibitor by checking the chelator16 “warhead” existence in the molecule.
However, the rest of an HDAC inhibitor is still important and responsible for selectively binding to HDAC target (molecular recognition). The molecular recognition part of the HDAC inhibitor is associated with the overall molecular structure similarity,17 which can be calculated through global shape comparison (three-dimensional structure superimposing). A molecular shape comparison algorithm can be used to predict the potency of a molecular being an HDAC inhibitor by calculating the overall similarity to a known HDAC inhibitor.
Therefore, LSA is reported to compute the similarity of two molecular structures by considering the contributions of both overall similarity and local substructure match.
LSA consists of the following main steps:
(1) Mapping a common substructure between two molecular topological structures.
(2) Superimposing two three-dimensional molecular structures with substructure focused. LSA will assign weights to atoms in the substructure mappings acquired from step (1) when superimposing.
(3) Computing the similarity score based on the superimposing using Tanimoto protocol.
With such restriction, rWEGA will no longer treat every overlaid atom-pair equally while calculating steric structure similarity. The atoms in the atom-pairs of the mappings will be assigned with a weight wa (if the atom in molecule A) or wb (if the atom in molecule B) to address that these atoms are more important than other atoms regarding the contributions to the activity. The weights are computed with eqn (1) and (2).
(1) |
(2) |
The LSA similarity scoring calculation in rWEGA is described in Algorithm 1.
Let vij be the intersection weighted-Gaussian volume4 of atom i and atom j. VAC, VBC, VAE, VBE, VC, VE are computed in eqn (3)–(8):
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
Sc(Mj) is the core substructure similarity of A and B based on the jth mapping. Sec(Mj) is the structural similarity of A and B based on the jth mapping excluding the core substructures. S(Mj) is the similarity of A and B based on the jth mapping. If M have multiple substructure mappings, take the maximum S(M) as the similarity of A and B. Sc(Mj), Sec(Mj), S(Mj) and S(M) are computed in eqn (9)–(12):
(9) |
(10) |
(11) |
S(M) = Max(S(Mj)), j ∈ 1…nm | (12) |
If molecule B have multiple conformations, take the maximum as the final similarity of A and B. Let Si(M) be the similarity of A and B for the ith conformation of B, S(A, B) is the final similarity score of molecule A and B calculated from the values of Si(M) as shown in eqn (13).
S(A, B) = Max(Si(M)), i ∈ 1…nc | (13) |
In order to validate LSA, three-dimensional conformations of the compound structures in the libraries were generated by CAESAR20 module in Discovery Studio (version 3.5) with the energy interval of 20 kcal mol−1. The CQS were specified by reference to the common structure of “active” molecules derived from DUD-E.¶
AUC (the area under a receiver-operating characteristic curve) values and enrichment factors (EF) at the top x% (x = 1, 5, 10) are used to measure the performance of LSA when it is used in virtual pharmaceutical screening experiments. EFx% is calculated:
EFx% = (TPx%/N%selected)/(Nactives/Ntotal) | (14) |
Method | AUC | EF1% | EF5% | EF10% |
---|---|---|---|---|
WEGA | 0.74 | 20.7 | 7.5 | 4.4 |
Rigid-LS-align | — | 20.1 | 6.9 | 4.3 |
Flexi-LS-align | 0.75 | 22.0 | 7.2 | 4.5 |
SPOT-ligand2 | — | 24.1 | 8.6 | 5.2 |
LSA | 0.82 | 27.0 | 10.3 | 6.1 |
Compared to WEGA, the screening performance of LSA were significantly improved. The mean AUC of DUD-E collection by LSA is 0.82, while WEGA gives a mean AUC of 0.74. LSA can achieve an average EF1% of 27.0, which is about 30.4% higher than that of WEGA. We also calculated the median AUC. The median AUC of DUD-E collection by LSA is 0.84, while WEGA gives a median AUC of 0.72. All results of LSA and WEGA were treated with Wilcoxon signed rank test, p < 0.001. The virtual screening performances of 89.2% (91/102) libraries were improved with LSA, indicating that LSA had consistently better performance than WEGA. The detailed AUC results are provided in ESI (Table S1†). The ROC (receiver-operating characteristic) curves of the top-12 most performance improved targeted libraries (targeting UROK, CAH2, HDAC2, ADRB1, PYGM, ADRB2, MK10, PLK1, HIVINT, ACE, ROCK1 and THRB) virtual screenings using LSA and WEGA are depicted in Fig. 3, in which the curves in red are for LSA and the curves in black are for WEGA. The turning points of the curves are usually at the earlier stages of ROC curves, indicating that screening less than 20% of the compounds in a library can capture more than 80% intrinsic hits with LSA.
Fig. 3 The ROC curves of top-12 most performance improved targeted libraries virtual screenings using LSA and WEGA. The curves in red are for LSA and the curves in black are for WEGA. |
We further compared LSA with LS-align21 and SPOT-ligand2 (ref. 22) which had been reported recently. It can be seen that LSA consistently had better performance as well. The EF1% values by LSA are 22.7% and 12.0% higher than that by Flexi-LS-align and SPOT-ligand2 respectively. To further investigated the performance within DUD-E, we split DUD-E collection into four categories,21 including kinases, proteases, nuclear receptors and GPCRs. The EF results of WEGA, Rigid-LS-align and LSA are as depicted in Table 2.
Categories (#proteins) | Method | EF1% | EF5% | EF10% |
---|---|---|---|---|
Kinases (26) | WEGA | 17.7 | 6.4 | 3.8 |
Rigid-LS-align | 19 | 6.5 | 4.2 | |
LSA | 26.4 | 10.2 | 5.9 | |
Proteases (15) | WEGA | 14.4 | 6.2 | 4.0 |
Rigid-LS-align | 15.4 | 6.3 | 4.3 | |
LSA | 24.9 | 11.3 | 6.5 | |
Nuclear receptors (11) | WEGA | 27.8 | 9.0 | 5.4 |
Rigid-LS-align | 22.2 | 7.2 | 4.6 | |
LSA | 22.3 | 8.9 | 5.7 | |
GPCRs (5) | WEGA | 9.6 | 3.8 | 2.7 |
Rigid-LS-align | 16.6 | 5.5 | 3.6 | |
LSA | 18.0 | 7.0 | 5.9 |
Fig. 4 The superimposed structures. The core substructures are superimposed in the magnifier. The molecule in green is CHEMBL343068 and the other molecule is CHEMBL275089 as in Fig. 1. |
However, there might exist multiple privileged core substructures in a query structure. LSA cannot handle these cases. Although, these cases are rare.
Successfully applying LSA depends also on correctly specifying a core query substructure in a template structure. A larger core query substructure may result in no hits. A user should figure out the balance point of this technology. Our experience indicates that LSA is more suitable for screening bioactive compounds with a “warhead”, or a covalent binding group.23
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c8ra08915a |
‡ Equal contributors. |
§ The Directory of Useful Decoys (DUD-E) collection are available in the website, http://dude.docking.org/. |
¶ LSA software and user guide can be downloaded for academic use at https://github.com/MingCPU/LSA.git |
This journal is © The Royal Society of Chemistry 2019 |