Natchanon Suviriyapaisala and
Duangdao Wichadakul*ab
aDepartment of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, 10330, Thailand. E-mail: duangdao.w@chula.ac.th
bCenter of Excellence in Systems Biology, Faculty of Medicine, Chulalongkorn University, Bangkok, 10330, Thailand
First published on 24th August 2023
Artificial intelligence has become more prevalent in broad fields, including drug discovery, in which the process is costly and time-consuming when conducted through wet experiments. As a result, drug repurposing, which tries to utilize approved and low-risk drugs for a new purpose, becomes more attractive. However, screening candidates from many drugs for specific protein targets is still expensive and tedious. This study aims to leverage computational resources to aid drug discovery by utilizing drug-protein interaction data and estimating their interaction strength, so-called binding affinity. Our estimation approach addresses multiple challenges encountered in the field. First, we employed a graph-based deep learning technique to overcome the limitations of drug compounds represented in string format by incorporating background knowledge of node and edge information as separate multi-dimensional features. Second, we tackled the complexities associated with extracting the representation and structure of proteins by utilizing a pre-trained model for feature extraction. Also, we employed graph operations over the 1D representation of a protein sequence to overcome the fixed-length problem typically encountered in language model tasks. In addition, we conducted a comparative analysis with a baseline model that creates a protein graph from a contact map prediction model, giving valuable insights into the performance and effectiveness of our proposed method. We evaluated the performance of our model using the same benchmark datasets with a variety of matrices as other previous work, and the results show that our model achieved the best prediction results while requiring no contact map information compared to other graph-based methods.
With the evolution of machine learning, several computational techniques and strategies demonstrated their effectiveness for various applications, including predicting drug-target binding affinity (DTA),4,5 which is a task predicting a score that indicates the strength of drug-target pair interaction and can be used to estimate how well a candidate drug can bind with a target protein. As a result, binding affinity has become a criterion for selecting candidate compounds and sped up the entire drug development process. Furthermore, the predicted compounds from DTA can be used in further experiments to obtain a final result rather than developing a new drug whose time cost and financial expenses are incredibly high.6,7
Although early computational models for predicting drug-protein interactions were developed by manually extracting drug and target properties, they required much biological and chemical knowledge. Because drugs and proteins are represented as text sequences by manual extraction, some information, such as structural information, was lost. While known structural proteins having crucial information, such as docking sites, can use computational techniques to carry out the results, unknown-structural proteins, which have less information except for the sequence, become challenging.
In earlier research, DTA prediction is often treated as a binary classification task8–10 by predicting whether a drug and target protein interact. However, to make it more realistic, most recent works predict a value of affinity score to determine the strength of interaction and switching to a regression task.
Recently, several machine learning models have shown remarkable results in DTA prediction. For example, the early work DeepDTA11 took an advancement of convolution neural network (CNN) applied to DTA prediction by using a 1-dimensional (1D) convolution technique to capture the patterns of data from drug and protein sequences, then passed those representations through several hidden layers and regressed to get the DTA scores. On the other hand, MT-DTI12 took advantage of a natural language processing (NLP) technique called “Transformer” to extract molecular representation by treating a molecule sequence as a text and applying the same method as DeepDTA to get a prediction result.
To acquire a better representation of data, GraphDTA13 proposed a graph-based representation learning of drug molecules capable of extracting the structural information and enhancing the predictive performance while representing a protein in a 1D-CNN network. The DGraphDTA14 was an extension of the GraphDTA in which drug molecules and protein sequences were defined as a 2D graph, employing a protein structure prediction approach called contact map to predict protein structure. The authors of GraphDTA also proposed GEFA,15 an early fusion method that combined drug and protein features in the early stage before extracting their representations. The authors contrasted GEFA with GraphDTA, a late fusion method that learned the representations of drug and protein separately. In addition to DGraphDTA, the same authors also released a new work called WGNN,16 which improved the representation using weighted graphs with the weights determined from the contact strength of the contact map prediction.
Another graph-CNN method called MGraphDTA17 used a technique called multi-scale to help a model learn to capture the local and global structures of a compound or a protein simultaneously by learning a variety of scales from shallow to deep and aggregating features from all scales to get a multi-scale feature for each drug and protein. Apart from the graph-based method, SMT-DTA18 achieved a promising result for this task using a transformer-based model trained with a semi-supervised technique by setting one task as a masked language model and another as a DTA task where both shared parameters.
However, these models frequently describe drug and protein structures but ignore the relationship and information between nodes, which is not a natural way to represent compounds or proteins formed by atoms and bonds. In this paper, we propose an approach called iEdgeDTA to predict binding affinity by utilizing one of the variants of graph neural network (GNN) known as graph convolutional network (GCN)19 and improve it by including a multi-dimensional edge feature in an encoding space of the compound. We also considered using a pre-trained model for extracting a node embedding to use as a protein node (an amino acid) representation fed to GNN by constructing a 1D graph from the sequence and incorporating the global feature of the protein. It is noteworthy that the whole process used only SMILES and protein sequences.
The main contribution of this study can be summarized as follows.
• We enhanced drug molecular representation by adding edge information between each node of a drug molecule into graph operation.
• We introduced a 1D sequential graph and pseudo structure to represent the protein sequence and utilized GNN to overcome a fix-length problem.
• We extracted protein features from a sizeable pre-trained model and used it to obtain a global feature. So our method uses both node-level and sequence-level features to learn a variety of scales and get a better representation.
Davis dataset contains 30056 drug-target interactions of 68 inhibitors and 442 kinase proteins that cover more than 80% of the human catalytic protein kinome. It uses the dissociation constant (Kd) value to describe the strength of the interaction. Similar to DeepDTA and other works, we transformed the Kd value into a log scale called pKd (see DeepDTA11).
KIBA dataset also comprises interactions of kinase inhibitors but constructed from a different bioactivity source, including Ki, Kd, and IC50. The KIBA score was made by optimizing these values, and we used this score to determine the binding affinity for this task. The original KIBA database contains a bioactivity matrix comprising 52498 compounds and 467 targets, including 246088 observations, and He et al.22 removed all drugs and proteins with fewer than ten interactions to get a filtered dataset, which we used in this work. The filtered KIBA dataset comprises 118254 drug-target interactions of 2111 and 229 unique drugs and proteins.
Feature | Size |
---|---|
Node features | (66) |
Atomic symbol | 18 |
Atomic chiral tag | 4 |
degree of atom | 8 |
Number of formal charge | 11 |
Number of explicit and implicit Hs | 7 |
Number of radical electron | 8 |
Atom hybridization | 8 |
Is aromatic | 1 |
Is in ring | 1 |
Edge features | (18) |
Bond type | 4 |
Bond direction | 7 |
Bond stereo configuration | 6 |
Bond is conjugated | 1 |
We employed a multidimensional edge feature in this study to pass additional information to each convolution layer. The edge characteristics comprise bond type, bond stereo, and stereo atom, and the drug compound graph is a symmetric network representing a bidirectional graph for our configuration. To capture comprehensive information, including both the neighborhood and the individual node itself, we incorporated self-loops for all nodes in the graph. This self-loop ensures that the model learns the state embedding, which contains the surrounding nodes' context and the node's features.
Many proteins have unknown structures. To extract structural information, we require a very complex model, such as AlphaFold,27 which is resource-intensive to yield a precise result. In addition, prior work learned patterns from protein sequences using CNN or NLP, whereas other structural models used the predicted protein structure of other trained models. This paper proposes a novel way to solve the fixed-length problem of CNN and NLP using a GNN. First, we establish a pseudo-structure in which each amino acid represents a node and create an edge from its adjacency node in the sequence. This pseudo-structure allows us to utilize graph message-passing operations in a sequence string. We illustrate the process of protein graph construction in Fig. 2.
Fig. 2 Construction of protein 1D graph representation where amino acid-level feature (node feature) and sequence-level feature (global feature) are extracted from the pre-trained model. |
With the constrained sequence length to be fewer than 1024 amino acids of the ESM, while about 20–25% of our protein sequences comprise more than 1024 amino acids (ESI Fig. S1†), we must introduce a way to approximate long-sequence protein feature. Our approach divided a protein sequence into several pieces to make sub-sequences, where one sub-sequence with a fixed length of 500 (windows) and a step of 5 (stride), having 495 amino acids overlapping with the consecutive sub-sequence for a better approximation, and the final sequence feature being the average of all sub-sequences as described in Algorithm 1 and depicted in Fig. 3.
In our study, we conducted experiments using two pre-trained models, namely ProtTrans28 and Evolutionary Scale Modeling (ESM).26 The findings of these experiments are presented in the Module analysis section. Both pre-trained models employ a self-supervised transformer-based model with masking, in which each input token is corrupted by randomly substituting an amino acid with a unique token. Then, they were trained to predict the missing token from the corrupted sequence to learn a pre-training task.
ProtTrans28 is a language model taken from the NLP technique. It has many variants, and the one we experimented on was the T5-XL model, an encoder-decoder model trained on the BFD dataset29,30 and fine-tuned on the UniRef50 dataset,31 which contains 2122 and 45 million protein sequences, respectively. The T5-XL model extracts protein features by employing a transformer model and predicting token-level and protein-level classification tasks. We extracted the embedding features with 1024 dimensions per residue from the last hidden layer before predicting both tasks of the transformer model.
ESM26 is also a transformer-based language model developed by Facebook research, trained on only the UniRef90 dataset32 comprising 98 million diverse protein sequences with 1280 dimensions of embedded features per residue. However, ESM has a restriction on sequence length, as it was trained with a fixed context size of 1024 tokens for positional embedding. Therefore, for protein sequences that are longer than 1024 amino acids, they used the random crop to reduce the protein sequence length to 1024 tokens for every training epoch to get sample sequences.
In this study, our focus is on predicting affinity scores, which are continuous values. To accomplish this, we employed GCN as a layer for feature extraction to produce the node-level representation of both drug and protein, using sequence-level features to create the whole protein sequence representation and a fully connected layer to gather information and produce the predicted affinity scores.
(1) |
(2) |
In which H(l) in eqn (1) is replaced with H(l) + EW(l)e, where E is a multi-dimensional edge feature and W(l)e is a learnable parameter that transforms the dimension of an edge feature into the dimension of a node feature. Note that the edge in the drug graph is undirected, which means that the connected nodes utilize the same edge feature for message propagation.
In our architecture, we transformed drug SMILES and protein sequences into a graph before giving them to two GCNs, each consisting of a three-layer convolutional network for extracting a representation from a graph. A global pooling layer is placed after the two GCNs to ensure that a graph-level representation with the same dimension is obtained. Each operation may be regarded as a hyperparameter when calculating a global pooling layer as a sum, mean, or max. In this work, we used the global mean pooling layer, eqn (3), where x is a feature matrix for each node in the GCN output. N is the number of nodes in the sequence, and r is the pooling output representation with a size of (1, F), where F is the number of output channels for the last layer.
(3) |
We added batch normalization (BatchNorm)33 after every output layer in fully connected layers and graph normalization (GraphNorm)34 for GCN layers, each activated by the Rectified Linear Unit (ReLU) function. We then concatenated the representation of the drug's latent feature, protein's latent feature, and protein's global feature to two fully connected layers to predict an affinity score.
(4) |
We optimized our model via AdamW35 optimization algorithm, which improves regularization in Adam by decoupling the weight decay from the gradient update. We set a weight decay parameter as 0.01 for both datasets to prevent overfitting. In addition, we used the learning rate decay, starting from 0.001 and decaying by scaling down to 80% for every 100 epochs to improve the learning of complex patterns in the late iteration process.
We tuned most of the hyperparameters manually and selected some others from the prior work, as shown in Table 2 as we employed cross-validation techniques, and the training process took a significant amount of time to complete.
Hyperparameter | Setting |
---|---|
Epoch | 1000 |
Optimizer | AdamW |
Batchsize | 128 |
Learning rate | 0.001 |
Weight decay | 0.01 |
Dropout rate | 0 (see batch normalization) |
Latent size after GCN | 128 |
(5) |
(6) |
Another metric used by some work is the Pearson correlation coefficient, eqn (7), which measures the strength of the linear relationship between two variables, predicted and ground truth in our case, where cov is the covariance between the predicted value p and the actual value y, and σ(p), σ(y) indicates the standard deviation of the predicted p and actual y respectively.
(7) |
Tables 3 and 4 show the benchmarking results of our model compared with other methods, differently representing drug compounds and proteins, including CNN-based, graph-based, and transformer. Our technique combining Edge-GCN and 1D-GCN achieved considerable improvements in MSE and CI on the Davis dataset, reaching 0.216 for MSE, 0.897 for CI, and 0.855 for Pearson correlation on the independence test set. The KIBA dataset showed that our model outperformed other methods by having the lowest MSE and highest Pearson correlation. It also had the highest CI score among other graph-based methods (same rank as MGraphDTA), while SMT-DTA, a transformer-based model, had the highest CI score at 0.894. Our model is highly effective compared to the graph-based methods, e.g., GEFA, which previously got the lowest MSE of 0.228 on Davis. Notably, our model does not use protein structure compared to existing graph-based methods that extract protein graphs from contact map prediction, and our model still gets a better result. Therefore, it may imply that the pre-trained model, i.e., ESM, learned how proteins are formed or learned some structural information.
Model | Compound | Protein | MSE (std) | CI (std) | Pearson (std) |
---|---|---|---|---|---|
a These results were taken from our reproduction. | |||||
DeepDTA11 | CNN | CNN | 0.261 | 0.878(0.004) | — |
MT-DTI12 | Transformer | CNN | 0.245 | 0.887(0.003) | — |
GraphDTA13,a | GIN | CNN | 0.251(0.003) | 0.882(0.003) | — |
DGraphDTA14,a | GCN | GCN(contact) | 0.238(0.005) | 0.888(0.004) | 0.840(0.003) |
GEFA15 | GCN | GCN(contact) | 0.228 | 0.893 | 0.846 |
WGNN16,a | GCN | WGCN(contact) | 0.244(0.004) | 0.888(0.002) | 0.837(0.002) |
MGraphDTA17,a | MGNN | MCNN | 0.233(0.005) | 0.885(0.004) | 0.843(0.004) |
SMT-DTA18 | Transformer | Transformer | 0.219 | 0.890 | — |
Our | Edge-GCN | 1D-GCN | 0.216(0.004) | 0.897(0.001) | 0.855(0.002) |
Model | Compound | Protein | MSE (std) | CI (std) | Pearson (std) |
---|---|---|---|---|---|
a These results were taken from our reproduction. | |||||
DeepDTA11 | CNN | CNN | 0.194 | 0.863(0.002) | — |
MT-DTI12 | Transformer | CNN | 0.152 | 0.882(0.001) | — |
GraphDTA13,a | GAT&GCN | CNN | 0.186(0.009) | 0.871(0.001) | — |
DGraphDTA14,a | GCN | GCN(contact) | 0.148(0.002) | 0.889(0.002) | 0.885(0.002) |
MGraphDTA17,a | MGNN | MCNN | 0.150(0.004) | 0.890(0.002) | 0.883(0.003) |
SMT-DTA18 | Transformer | Transformer | 0.154 | 0.894 | — |
Our | Edge-GCN | 1D-GCN | 0.139(0.001) | 0.890(0.001) | 0.892(0.001) |
We illustrate the impact of batch normalization in Fig. 6. During the early epochs before the red dashed line, the model incorporating batch normalization exhibits faster convergence than the model without. However, in the later epochs, the model's mean squared error (MSE) without batch normalization decreases, indicating a better fit to the training set. Nevertheless, the results on the validation set depicted in Fig. 7 reveal that the model without batch normalization performs worse than the other models, suggesting the presence of an overfitting issue and showing how the batch normalization helps generalize the model. In summary, we achieved additional regularization without a dropout layer by incorporating batch normalization into our model, expedited the training process, and improved overall performance (see ESI Fig. S3† for the performance of the dropout layer).
Fig. 6 Training performance of the model with different batch normalization setting, the model without batch normalization (blue) get the best performance in training set (lower is better). |
Model | MSE (std) | CI (std) |
---|---|---|
Baseline (DgraphDTA) | 0.212(0.006) | 0.885(0.004) |
Pre-trained choice | ||
ProtT5 | 0.206(0.008) | 0.895(0.005) |
ESM-1v | 0.202(0.007) | 0.897(0.005) |
Edge feature | ||
ESM-1v w/o edge feature | 0.202(0.007) | 0.897(0.005) |
ESM-1v w/edge feature | 0.198(0.006) | 0.899(0.002) |
Protein global feature | ||
ESM-1v w/edge feature | 0.198(0.006) | 0.899(0.002) |
iEdgeDTA(ESM + edge + global) | 0.195(0.005) | 0.901(0.002) |
The findings indicated that both pre-trained models performed better than the baseline model (DGraphDTA), which relied on manually crafted features for protein representation. However, ESM-1v provided superior performance to PortT5, where the MSE value was 0.201, and the CI value was 0.898. As a result, we selected ESM-1v as our pre-trained model for final evaluation.
The results indicated that edge features on only the drug branch could improve the performance to achieve 0.198 for the MSE value and 0.899 for the CI value. Then, we assessed the performance of the protein's global feature, and the findings indicated that this feature also helped improve the prediction result, with MSE lowering to 0.195 and CI score increasing to 0.901 for the optimal iteration.
Besides, we compared our edge feature integration method to existing methods such as RGCN, which improves prediction by employing independent weights for each relation type rather than sharing weights across all nodes and edges. In this task, the bond type is passed as a relation type consisting of a single bond, double bond, triple bond, aromatic, and self-loop, resulting in five distinct weights. As shown in Table 6, utilizing RGCN does not increase model performance on a five-fold cross-validation set. This result may be related to our approach leveraging multidimensional edge features, whereas RGCN can only accept a single feature as the relation type.
Model | Edge feature | MSE (std) |
---|---|---|
RGCN | Relation | 0.201(0.006) |
Our (GCN) | None | 0.201(0.006) |
Our (Edge-GCN) | Muti-dimension | 0.198(0.006) |
Furthermore, we incorporated our Edge-GCN approach into our baseline model, DGraphDTA, to assess the efficacy of integrating an edge feature into drug representations. In this experiment, we replaced the original GCN layers with our Edge-GCN layers in the drug branch, using the same edge feature introduced in this paper. The results are shown in Table 7, revealing that introducing an edge feature to the graph neural network improves both MSE and CI scores.
Model | MSE (std) | CI (std) |
---|---|---|
DGraphDTA | 0.212(0.007) | 0.888(0.004) |
w/Edge-GCN | 0.207(0.006) | 0.895(0.002) |
To assess the impact of individual contributions of node-level and sequence-level features for protein feature extraction, we selected a CNN architecture as our baseline due to its prevalence in prior studies and its suitability for assessing the effectiveness of our 1D-GCN in addressing the fixed-length problem. We set the hyperparameters of the CNN following GraphDTA13 work. Specifically, we fixed the protein sequence length at 1000 amino acids, padded shorter sequences, and truncated longer sequences. In Table 8, we present the results obtained from the experiment, showing the performance of the individual node-level and sequence-level features as standalone versions. These results demonstrate the improvement achieved by each feature independently and highlight their respective contributions. Moreover, integrating both node-level and sequence-level features within the 1D-GCN model yielded significant enhancements, surpassing the performance of the baseline model. This outcome shows the benefits of combining these features, further improving prediction accuracy.
Model | Method | MSE (std) | CI (std) |
---|---|---|---|
CNN baseline | CNN | 0.210(0.007) | 0.891(0.003) |
Node feature | 1D-GCN | 0.198(0.005) | 0.899(0.002) |
Global feature | Linear | 0.202(0.005) | 0.896(0.003) |
iEdgeDTA | 1D-GCN + Linear | 0.195(0.005) | 0.901(0.002) |
Contact map model | MSE (std) | CI (std) |
---|---|---|
DGraphDTA | ||
pconsc4 | 0.212(0.006) | 0.885(0.004) |
1D-GCN | 0.211(0.007) | 0.893(0.001) |
iEdgeDTA | ||
pconsc4 | 0.197(0.005) | 0.901(0.003) |
RaptorX | 0.196(0.006) | 0.899(0.004) |
ESM | 0.198(0.004) | 0.900(0.004) |
1D-GCN | 0.195(0.005) | 0.901(0.002) |
Our findings revealed that the performance of the 1D-GCN setting is on par with that of a model utilizing a contact map. Furthermore, the 1D-GCN approach exhibited slightly superior performance in our work and the DGraphDTA experiment compared to the contact maps derived from other prediction models. These results indicated that the structural information obtained from contact map predictions might contain noise that could potentially degrade model performance, suggesting that achieving enhanced performance requires either a genuine contact map or highly accurate predictions. Nonetheless, our comparative evaluation highlights that the 1D-GCN approach delivers similar performance without using contact maps. Moreover, given that actual protein contact maps are a limited resource, generating experimentally determined contact maps for proteins can be difficult and time-consuming, requiring specific equipment, specialized knowledge, and substantial resources, making them less accessible and limited to specific research facilities. Many strategies employ machine learning algorithms or statistical methods to infer residue–residue contacts based on the available protein sequence. However, due to the complexity of the nature of proteins, accurately predicting contact maps remains a challenging problem that requires substantial computational resources. Our model holds promise for real-world applications, such as high-throughput virtual screening, using only protein sequences.
However, iEdgeDTA does have limitations. First, using a 1D representation poses a drawback, as the protein's folding determines the binding pocket, and the 1D model does not guarantee that the binding pocket will correspond to a series of adjacent amino acids in the sequence string. Therefore, visualizing or predicting the binding pocket solely from the 1D representation remains a challenging problem. Second, a limitation lies in the pre-trained model, which may generate noise during extracting long protein sequences, as estimated in Algorithm 1. Using a more accurate pre-trained model that can handle arbitrary sequence lengths will reduce the model's sensitivity to noise, which may improve its performance.
Footnote |
† Electronic supplementary information (ESI) available: Detailed drug feature, performance of iEdgeDTA evaluated in the same setting as reproduced previous work, dataset distribution, graphical explanation of cross-validation and testing, performance of dropout layers. See DOI: https://doi.org/10.1039/d3ra03796g |
This journal is © The Royal Society of Chemistry 2023 |