Taras Voitsitskyi*ac,
Roman Stratiichukad,
Ihor Koleieva,
Leonid Popryhoa,
Zakhar Ostrovskya,
Pavlo Henitsoia,
Ivan Khropachova,
Volodymyr Vozniaka,
Roman Zhytara,
Diana Nechepurenkoa,
Semen Yesylevskyyabc,
Alan Nafiieva and
Serhii Starosylaa
aReceptor.AI Inc., 20-22 Wenlock Road, London N1 7GU, UK. E-mail: taras270698@gmail.com
bInstitute of Organic Chemistry and Biochemistry, Czech Academy of Sciences, CZ-166 10 Prague 6, Czech Republic
cDepartment of Physics of Biological Systems, Institute of Physics of The National Academy of Sciences of Ukraine, Nauky Ave. 46, 03038, Kyiv, Ukraine
dDepartment of Biophysics and Medical Informatics, Educational and Scientific Centre “Institute of Biology and Medicine”, Taras Shevchenko National University of Kyiv, 64 Volodymyrska Str., 01601 Kyiv, Ukraine
First published on 31st March 2023
Accurate prediction of the drug-target affinity (DTA) in silico is of critical importance for modern drug discovery. Computational methods of DTA prediction, applied in the early stages of drug development, are able to speed it up and cut its cost significantly. A wide range of approaches based on machine learning were recently proposed for DTA assessment. The most promising of them are based on deep learning techniques and graph neural networks to encode molecular structures. The recent breakthrough in protein structure prediction made by AlphaFold made an unprecedented amount of proteins without experimentally defined structures accessible for computational DTA prediction. In this work, we propose a new deep learning DTA model 3DProtDTA, which utilises AlphaFold structure predictions in conjunction with the graph representation of proteins. The model is superior to its rivals on common benchmarking datasets and has potential for further improvement.
The most accurate computational estimate of DTA could be obtained from atomistic molecular dynamics simulations (either classical, quantum or hybrid) combined with one of the modern techniques of computing the free energy of ligand binding.5 However, accuracy comes at the cost of very high computational demands, which makes these methods generally impractical for large-scale virtual screening.
That is why the common method of choice for estimating DTA in modern drug discovery is molecular docking, which provides a reasonable compromise between accuracy and computational efficiency.6 However, it is generally believed that empirical scoring functions used in molecular docking have already approached the practical limit of accuracy, which is unlikely to be improved without introducing an additional computational burden.
In order to address these drawbacks the classical machine learning (ML) methods for determining DTA were developed. These methods do not depend on computing physical interactions between the target protein and the ligand. They are purely knowledge-based and rely on the idea that similar ligands tend to interact with similar protein targets, which are encoded into the pre-trained neural networks. Applying such models to any given ligand is blazingly fast, which allows using them for screening the numbers of compounds, which are unreachable for molecular docking or MD simulations. However, even the most successful methods from this category, such as KronRLS7 and SimBoost,8 were recently outperformed by more modern deep learning (DL) techniques.
Seminal DL methods of assessing DTA used string representation of the target's amino acid sequence and a simplified linear representation of the ligands using SMILES (molecular-input line-entry system), which were subsequently encoded by 1D convolutional neural networks (CNNs) and/or long-short-term memory (LSTM) blocks.9,10 The same linear representations were shown to be efficient for DTA prediction in combination with generative adversarial networks (GANs).11 However, it is obvious that linear representations lead to a huge loss of information and keeping the knowledge about connectivity and 3D arrangements of both the target protein and the ligands could improve the results. Indeed, the introduction of graph neural networks (GNNs), which preserve information about connectivity and the 3D arrangement of atoms, improved the scores of the models on most of the benchmarking DTA datasets.12
In contrast to the sequence-based techniques, the performance of the GNNs depends strongly on the availability and accuracy of 3D structures of target proteins. The scarcity of such structures limited the size of the training datasets and hampered the progress of GNN-based DTA models. The huge success of AlphaFold13 in protein structure prediction opens new opportunities for developing better DTA prediction models. Accurately predicted 3D structure of druggable protein domains that don't have experimentally resolved structures, adds unprecedented amounts of data for training ML models and improving their performance.
In this article, we proposed a new method of constructing efficient residue-level protein graphs based on the target's 3D structure predicted by AlphaFold and selected the best GNN architectures for this kind of data. This resulted in a new deep-learning model for predicting drug-target affinities: 3DProtDTA. When applied to common benchmark datasets our model is superior to its rivals on all evaluated metrics. The perspectives of applications and further improvements of our model are discussed.
The Davis dataset contains the pairs of kinase proteins and their respective inhibitors with experimentally determined dissociation constant (Kd) values. Kd values were transformed by eqn (1) and transformed scores were used as labels for benchmarking in the same way as in the baseline approaches. There are 442 proteins and 68 ligands in this dataset.
(1) |
The KIBA dataset comprises scores originating from an approach called KIBA, in which inhibitor bioactivities from different sources such as Ki, Kd and IC50 are combined. The KIBA scores were pre-processed by the SimBoost algorithm8 and the final values were used as labels for model training. Initially, the KIBA dataset contained 467 proteins and 52498 ligands. For benchmarking purposes, the same authors8 filtered the dataset to keep only the drugs and targets with at least 10 samples resulting in 229 unique proteins and 2111 unique ligands.
The numbers of affinity scores and unique entries in the datasets are summarised in Table 1.
Dataset | Proteins | Ligands | Samples |
---|---|---|---|
Davis | 442 | 68 | 30056 |
KIBA | 229 | 2111 | 118254 |
We used isomorphic SMILES strings for both datasets and UniProt16 accession codes for the KIBA dataset, provided by DeepDTA,10 as initial entries representation. For the Davis dataset, proteins with the same UniProt accession codes may represent different entries. Therefore, unique identifiers were assigned to each target entry annotated by UniProt accession code, mutation type, phosphorylation status, proteins in a complex, and protein domains. Fig. 1 shows distributions of labels for Davis (a) and KIBA (b) datasets.
Fig. 1 Distribution of Davis (top) and KIBA (bottom) labels used directly in benchmarking (note that the y-axis is log-scale). |
Python API of an open-source cheminformatics package RDKit v. 2021.03 was used to generate both ligand representations based on isomorphic SMILES. We calculated 1024 bit vectors of classical Morgan fingerprints with radius 2 and 1024-bit vectors of feature-based fingerprint invariants19 with the same radius. Both vectors were concatenated into a single 2048-bit vector.
The graphs for ligands were generated on the atomistic level (one node in the graph is one heavy atom in a ligand). Fig. 2 shows distributions of the number of heavy atoms in the used ligands, while Table 2 shows the features for a molecular graph representation of the ligands.
Fig. 2 Distribution of the number of heavy atoms in the ligands in Davis (top) and KIBA (bottom) datasets. The Y axis is log-transformed for the KIBA dataset. |
Name | Size |
---|---|
Node features | |
One-hot encoded atom type (C, N, O, F, S, Cl, Br, P, I) | 9 |
Atom mass (scaled by min–max) | 1 |
Number of directly bonded atom neighbours (scaled by min–max) | 1 |
Total number of bonded hydrogens (scaled by min–max) | 1 |
One-hot encoded atom hybridization (sp2, sp3) | 2 |
Is atom in a ring (1 – yes, 0 – no) | 1 |
Is atom aromatic (1 – yes, 0 – no) | 1 |
Is atom hydrophobic (1 – yes, 0 – no) | 1 |
Is atom metal (1 – yes, 0 – no) | 1 |
Is atom halogen (1 – yes, 0 – no) | 1 |
Is atom donor (1 – yes, 0 – no) | 1 |
Is atom acceptor (1 – yes, 0 – no) | 1 |
Is atom positively charged (1 – yes, 0 – no) | 1 |
Is atom negatively charged (1 – yes, 0 – no) | 1 |
Edge features | |
One-hot encoded bond type (single, double, triple, aromatic) | 4 |
Bond is conjugated (1 – yes, 0 – no) | 1 |
Bond is in the ring (1 – yes, 0 – no) | 1 |
The Morgan fingerprints and ligand graphs are available in the GitHub repository. The order of the features in the graphs is the same as in Table 2.
For the KIBA dataset, all the structures were obtained from the AlphaFold protein structure database (https://alphafold.ebi.ac.uk/). For the Davis dataset only single protein entries without mutations were downloaded from the AlphaFold database. For the rest of the entries, 3D structures were generated manually using an accelerated implementation of the AlphaFold algorithm in Google Colaboratory: ColabFold.20
To avoid undesirable noise from the parts of proteins, which have weak or no relation to the ligand binding, we have parsed domain annotations from UniProt16 to determine the ligand binding sites. Both datasets contain only the kinase enzymes and the ligands with kinase inhibitor activity. Consequently, only the domains with known kinase activity or related to the kinase activity (annotated by UniProt as a protein kinase, histidine kinase, PI3K/PI4K, PIPK, AGC-kinase, or CBS) were kept in the protein structures.
This preprocessing step not only decreased the noise in the data but also eliminated most residues with a low per-residue confidence score (pLDDT). In AlphaFold a pLDDT is a continuous scale from 0–100 (higher is better), which shows the quality of structure prediction. pLDDT lower than 70 emerges in predicted 3D structures if they are unstructured in physiological conditions or the amino acid sequence has low alignment depth13. The regions with such low scores should be treated with caution. Since the domains related to kinase activity are mostly well studied and available in databases of experimental protein structures, keeping only them and removing other regions improves the average pLDDT score. Fig. 3 shows the number of amino acids in processed PDB files (A) as well as the distribution of pLDDT scores (B).
Filtered 3D structures were converted into the residue-level graphs using Biopython v. 1.79 (ref. 21) and Pteros.22,23 Inspired by the Open Drug Discovery Toolkit,24 we have developed an approach for encoding protein properties in the graph edge features. An edge was created if two amino acids form an either covalent bond or a non-covalent contact within a particular distance cutoff. The edge features define the type of this connectivity (Table 3). This technique allowed reducing the size of the protein graph in terms of the number of edges compared to the conventional protein graph generation approaches that define the same distance threshold for all types of residue–residue interactions.
Bond type | Distance cutoff (Å) |
---|---|
Covalent | — |
Hydrophobic contacts | 4 |
Hydrogen bond | 3.5 |
Salt bridge | 4 |
Cation–pi interaction | 5 |
Perpendicular pi-stacking | 5 |
Parallel pi-stacking | 5 |
Such a graph is directed because of the hydrogen bonds, salt bridges, and cation–pi interactions, which are non-commutative and require specification of the roles for each of involved residues. For example, in the edge created by the hydrogen bond between nodes a and b, it is important to identify which node is a donor and which one is an acceptor. Thus, edge a–b is assigned edge features that are different from those of edge b–a. Similarly, the salt bridges require distinguishing between cationic residue and anionic residue nodes and the cation–pi interactions – between cationic and aromatic residues. In contrast, covalent bonds, pi-stacking and hydrophobic contacts are symmetric.
For each of the 20 standard amino acid types, we assigned seven characteristics AAPHY7 (ref. 25) and 23 BLOSUM62 values according to alignments of homologous protein sequences26 provided by GraphSol.27 In addition, the structure-dependent and sequence-dependent node and edge features were used (Table 4).
Name | Size |
---|---|
Node features | |
Solvent-accessible surface area (scaled by mean and standard deviation) | 1 |
Phi angle (in degrees, divided by 180) | 1 |
Psi angle (in degrees, divided by 180) | 1 |
One-hot encoded belonging to secondary structure (alpha helix, isolated beta-bridge residue, strand, 3–10 helix, turn, bend) | 6 |
AAPHY7 descriptors of a residue | 7 |
BLOSUM62 descriptors of a residue | 23 |
Phosphorylated (1 – yes, 0 – no, same for all nodes) | 1 |
Mutated (1 – yes, 0 – no, same for all nodes) | 1 |
Edge features | |
Covalent bond (1 – yes, 0 – no) | 1 |
Hydrophobic contact (1 – yes, 0 – no) | 1 |
Hydrogen bond (from donor to acceptor; 1 – yes, 0 – no) | 1 |
Hydrogen bond (from acceptor to donor; 1 – yes, 0 – no) | 1 |
Salt bridge (from cation to anion; 1 – yes, 0 – no) | 1 |
Salt bridge (from anion to cation; 1 – yes, 0 – no) | 1 |
Pi-cation interaction (from aromatic ring to cation; 1 – yes, 0 – no) | 1 |
Pi-cation interaction (from cation to aromatic ring; 1 – yes, 0 – no) | 1 |
Parallel pi-stacking (1 – yes, 0 – no) | 1 |
Perpendicular pi-stacking (1 – yes, 0 – no) | 1 |
As the two last node features for the protein graphs we added phosphorylation status and mutation status. Each of the two features is either 1 (phosphorylated/mutated) or 0 (non-phosphorylated/non-mutated) and is the same for each node in a protein graph. It is essential in the case of the Davis dataset due to the presence of protein entries with the same sequence but annotated as phosphorylated/non-phosphorylated or wild-type and mutant entries with internal tandem duplication (ITD) mutation that is hard to translate into sequence unambiguously. Ignoring this data would cause the situation when proteins with identical graph representation have different binding affinities to the same ligand.
Generated protein graphs are available in the GitHub repository. The order of features in the graph is the same as in Table 4.
We tuned 3DProtDTA for the single GNN type or the combination of several GNN types that provides the best cross-validation results on the benchmark dataset. The following GNN types were considered: Graph Attention Network that fixes the static attention problem (GAT),28 Graph Convolutional Network (GCN),29 Graph Isomorphism Network (GIN),30 Graph Isomorphism Network with incorporated edge features (GINE),31 and GCN for learning molecular fingerprints (GMF).32
Besides GNN types, the subjects to tuning included: the configuration of FC layers after 1D input data, GNN pooling output and final dense neural network; dropout rates; activation function for GNN layers; usage of batch normalisation; graph pooling type or combination of types.
The 3DProtDTA model was built with an open-source machine learning framework PyTorch33 and the GNNs were implemented using PyTorch Geometric.34
Similarity-based approaches KronRLS7 and SimBoost8 used a similarity matrix computed using Pubchem structure clustering server (Pubchem Sim, https://pubchem.ncbi.nlm.nih.gov) to represent ligands and the protein similarity matrix constructed with help of Smith–Waterman algorithm to represent targets.35 KronRLS uses the regularised least-square model while SimBoost is the gradient boosting machine-based method.
The DeepDTA method10 used 1D CNNs to process protein sequences and SMILES of the ligands. The GANsDTA11 proposed a semi-supervised GANs-based method to predict binding affinity using target sequences and ligand SMILES. The same initial protein and ligand representations were used in the DeepCDA9 method, where authors applied encoding by CNN and LSTM blocks. The GraphDTA12 authors proposed GNNs to process ligand graphs, while proteins were still encoded by CNN applied to sequences.
The mean squared error (MSE):
(2) |
The concordance index (CI):36
(3) |
(4) |
The rm2 index:37
(5) |
To assess our approach properly, we used the same train-test and cross-validation data split as in the baseline approaches. Specifically, KIBA15 and Davis14 datasets were divided into six equal parts in which one part is selected as the independent test set. The remaining parts of the dataset were used to determine the hyper-parameters via 5-fold cross-validation. All 5-fold training sets were used for model training. Subsequently, 5 trained models were applied to predict test set affinity. Finally, an average for each metric was calculated and compared to the baseline approaches. Train and test folds of the datasets were obtained from DeepDTA10 GitHub repository.
We tuned 3DProtDTA to choose the best GNN type or GNN types combination; the number of multi-head-attentions/size of output sample (number of output node features) in GNNs; usage of activation function after a GNN layer and type of the function (ReLU, Leaky ReLU, sigmoid); the configuration of FC layers; dropout rates; usage of batch normalisation; graph pooling type/types.
The tuning was performed with help of hyperparameter optimization software Optuna v. 3.0.3 (ref. 38) using the tree-structured Parzen estimator algorithm. We trained the model for 700 epochs and used a batch size of 32, the Adam optimiser with a learning rate of 0.0001, and the mean squared error loss function. The objective of the tuning was to minimise the MSE metric. The other metrics were used for testing the model performance but were not optimised during the training.
After the tuning by the tree-structured Parzen estimator algorithm, we ran a range of manual cross-validation experiments (after obtaining benchmarking results) keeping all the components in the best tuned architecture fixed except:
- The type of GNN architecture for a protein;
- The type of GNN architecture for a ligand;
- The type of graph pooling for both protein and ligand;
- Usage of ligand graph or Morgan fingerprint as the only ligand features.
These experiments were performed in order to assess in depth the impact of GNN architecture, graph pooling and ligand features. While changing the type of GNN model, we kept the size of output (or the number of attention heads in the case of GAT) equal or as close as possible to the tuned parameters.
The last set of experiments was conducted to assess how the 3DProtDTA performs on cold protein targets and, at the same time, to generalise over different kinase types. For that purpose, we annotated the proteins in both datasets using InterPro database entries.39,40 The number of proteins annotated by the top 20 entries can be found in Tables S6 and S7.† As expected, two major kinase groups in both datasets were tyrosine-protein kinases (InterPro entry IPR008266) and serine/threonine-protein kinases (InterPro entry IPR008271). Subsequently, we randomly selected 10 tyrosine kinases and 10 serine/threonine kinases from each benchmarking dataset and separated all the samples containing those proteins into cold target test datasets. The rest of the samples were filtered to create 3 training datasets (separately for Davis and KIBA benchmarks): (1) all the remaining samples without filtering; (2) the samples with all tyrosine kinases excluded; (3) the samples with all serine/threonine kinases excluded. Consequently, we obtained 3 models per benchmarking dataset: trained on all data, trained on the data without tyrosine kinases, and trained on the data without serine/threonine kinases. We collected the metrics from all cold target test splits.
Tables 5 and 6 compare obtained model performance metrics to the baseline approaches for Davis and KIBA datasets respectively. The best scores are shown in bold.
Approach | MSE | CI | rm2 |
---|---|---|---|
KronRLS | 0.379 | 0.871 | 0.407 |
SimBoost | 0.282 | 0.872 | 0.644 |
DeepDTA | 0.261 | 0.878 | 0.63 |
GANsDTA | 0.276 | 0.881 | 0.653 |
DeepCDA | 0.248 | 0.891 | 0.649 |
GraphDTA | 0.229 | 0.893 | 0.63 |
3DProtDTA | 0.184 | 0.917 | 0.722 |
Approach | MSE | CI | rm2 |
---|---|---|---|
KronRLS | 0.411 | 0.782 | 0.342 |
SimBoost | 0.222 | 0.836 | 0.629 |
DeepDTA | 0.194 | 0.863 | 0.673 |
GANsDTA | 0.224 | 0.866 | 0.675 |
DeepCDA | 0.176 | 0.889 | 0.682 |
GraphDTA | 0.139 | 0.891 | 0.673 |
3DProtDTA | 0.138 | 0.893 | 0.784 |
According to obtained results, 3DProtDTA considerably outperforms other approaches in terms of all 3 metrics on the Davis dataset. In the case of the KIBA dataset, the only metric that demonstrated significant improvement over competitors was rm2, while the two other metrics were comparable (but still superior) to the rivals.
The results demonstrate the nearly equal performance of molecular graph only and combined representation of ligands for the Davis dataset. Nevertheless, the combined representation is clearly superior to others in terms of the KIBA dataset.
We identified the molecular diversity in both datasets defined as 1 – Tanimoto similarity scores based on Morgan fingerprints with radius 3 and averaged for each pair of ligands in the dataset. They are equal to 0.882 and 0.892 for the Davis and KIBA datasets respectively. Despite comparable molecular diversity, the model performance on Davis and KIBA datasets is different when comparing graph only and combined representation. We attribute this to two factors: a different number of ligands (68 in Davis and 2111 in KIBA, Table 1) and a much wider distribution of the number of atoms in the molecules from the KIBA dataset (Fig. 2). The graph only representation is most likely insufficient to generalise over all the molecules in the KIBA dataset, especially, for the small cohort of ligands with more than 50 heavy atoms.
The model trained on the molecular graph and Morgan fingerprint together provided the best average MSE, CI, and rm2 index of both benchmark datasets (Tables S1–S5†). Therefore, we can suggest that these two ligand feature types contain some non-overlapping information useful for DTA prediction.
Fig. 6 demonstrates that add pooling is the best choice for the Davis dataset, while add pooling is comparable to the combined approach (mean + add + max poolings concatenated) in the KIBA dataset cross-validation results. The add pooling provides superior average MSE, CI, and rm2 index of both benchmark datasets (Tables S1–S5†).
Train dataset | Test, cold tyrosine kinases | Test, cold serine/threonine kinases | ||||
---|---|---|---|---|---|---|
MSE | CI | rm2 | MSE | CI | rm2 | |
All kinases | 0.305 | 0.880 | 0.536 | 0.311 | 0.889 | 0.509 |
No tyrosine kinases | 0.719 | 0.746 | 0.254 | 0.284 | 0.909 | 0.572 |
No serine/threonine kinases | 0.239 | 0.909 | 0.629 | 0.683 | 0.682 | 0.136 |
Train dataset | Test, cold tyrosine kinases | Test, cold serine/threonine kinases | ||||
---|---|---|---|---|---|---|
MSE | CI | rm2 | MSE | CI | rm2 | |
All kinases | 0.383 | 0.668 | 0.599 | 0.236 | 0.857 | 0.700 |
No tyrosine kinases | 0.904 | 0.634 | 0.155 | 0.238 | 0.858 | 0.693 |
No serine/threonine kinases | 0.359 | 0.729 | 0.625 | 0.589 | 0.713 | 0.277 |
The results show that the model trained on all kinase types performs quite similarly for both tyrosine and serine/threonine kinases in the Davis dataset. In contrast, the model trained on all kinase types of the KIBA dataset demonstrates considerably better performance for serine/threonine kinases. We expected the opposite result because the rate of serine/threonine kinases to tyrosine kinases is roughly 2:1 in the KIBA dataset and 3:1 in the Davis dataset (Tables S6 and S7†).
The models trained without tyrosine kinases demonstrate the best or nearly the best performance on the cold serine/threonine kinases test and vice versa. Their performance is better than for the models trained on all kinase types. This means that adding new kinase types to the dataset may have a negative impact on the model inference for other kinase types. A potential solution to this issue is training a multi-task model, which will contain common layers for all kinase types in the beginning and separate layers for each kinase type at the end. Development and testing of such a model are out of the scope of this work, however.
The models trained without a particular type of kinases perform significantly worse on the test with omitted types of kinases. This is expectable behaviour, which emphasises that the models perform the best for protein types, which were present in the training set.
Another straightforward improvement is the usage of atomic-level protein graphs instead of residue-level ones, and the usage of more comprehensive node and edge features. Particularly, the B-factors and other measures of protein flexibility, such as cross-correlation matrices of motions, could be used.
It is necessary to note that there are other ML-based approaches to predict drug-target affinity, which were not included in this study, such as CSatDTA,41 FusionDTA,42 NerLTR-DTA,43 Masashi's method,44 WGNN-DTA,45 etc. In these techniques, the experimental setups and data split into training and testing datasets are either different from what we use in the current work or not specified in enough detail. Some of them also utilise specific evaluation metrics, which prevent their direct comparison with other methods.
Our attempts to re-train some of these models using our data split had failed. For example, there is no source code available for NerLTR-DTA, while provided binary crashes. Since the bug reports on GitHub have not been answered for several years, we decided that further attempts to use this technique are futile. The authors of the FusionDTA do not report the amount of computational resources required for the model training. According to our internal estimates, this model is much heavier than 3DProtDTA and its proper re-training is prohibitively long and expensive on our computational facilities. In addition, FusionDTA utilises different hyperparameters for each dataset, which diverges from our concept of the universal model architecture and hyperparameters for all datasets.
Therefore, we decided to limit the scope of our work to the methods, which possess identical setups, datasets and performance metrics and thus allow the apple-to-apple comparison without re-training the third-party ML models.
Footnote |
† Electronic supplementary information (ESI) available: The best model architectures, tuned hyperparameters for the benchmark, detailed data on manual cross-validation experiments, InterPro entries statistics in the datasets, and protein residue-level graphs similarity between AlphaFold and experimental structures. See DOI: https://doi.org/10.1039/d3ra00281k |
This journal is © The Royal Society of Chemistry 2023 |