Hengwei
Chen
,
Martin
Vogt
and
Jürgen
Bajorath
*
Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany. E-mail: bajorath@bit.uni-bonn.de; Fax: +49-228-7369-100; Tel: +49-228-7369-100
First published on 28th October 2022
Activity cliffs (ACs) are formed by pairs of structurally similar or analogous active small molecules with large differences in potency. In medicinal chemistry, ACs are of high interest because they often reveal structure–activity relationship (SAR) determinants for compound optimization. In molecular machine learning, ACs provide test cases for predictive modeling of discontinuous (non-linear) SARs at the level of compound pairs. Recently, deep neural networks have been used to predict ACs from molecular images or graphs via representation learning. Herein, we report the development and evaluation of chemical language models for AC prediction. It is shown that chemical language models learn structural relationships and associated potency differences to reproduce ACs. A conditional transformer termed DeepAC is introduced that accurately predicts ACs on the basis of small amounts of training data compared to other machine learning methods. DeepAC bridges between predictive modeling and compound design and should thus be of interest for practical applications.
For a non-ambiguous and systematic assessment of ACs, similarity and potency difference criteria must be clearly defined.2,3 Originally, molecular fingerprints (that is, bit string representations of chemical structure) have been used as molecular representations to calculate the Tanimoto coefficient,4 a whole-molecule similarity metric, for identifying similar compounds forming ACs.2 Alternatively, substructure-based similarity measures have been adapted for defining ACs, which have become increasingly popular in medicinal chemistry, because they are often chemically more intuitive than calculated whole-molecule similarity.3 For example, a widely used substructure-based similarity criterion for AC analysis is the formation of a matched molecular pair (MMP), which is defined as a pair of compounds that are only distinguished by a chemical modification at a single site.5 Thus, MMPs can be used to represent pairs of structural analogues, which explains their popularity in medicinal chemistry. Moreover, MMPs can also be efficiently identified algorithmically.5 Although statistically significant potency differences for ACs can be determined for individual compound activity classes,6 for the systematic assessment of ACs and computational modeling, a potency difference threshold of at least two orders of magnitude (100-fold) has mostly been applied.2,3
While medicinal chemistry campaigns encounter ACs on a case-by-case basis, systematic compound database analysis has identified ACs across different compound activity classes, providing a wealth of SAR information.2,7 Here, computational and medicinal chemistry meet. With rapidly increasing numbers of publicly available bioactive compounds, AC populations have also grown over time.3 However, the rate at which ACs are formed across different activity classes has essentially remained constant. Only ∼5% of pairs of structural analogues sharing the same activity form ACs across different activity classes.3,7 Thus, as expected for compounds representing the pinnacle of SAR discontinuity, structural analogues rarely form ACs.
Systematic identification of ACs across activity classes has also provided the basis for computational predictions of ACs. For machine learning, AC predictions generally present a challenge, for three reasons. First, as discussed, the underlying SARs that need to be accounted for are highly discontinuous; second, data sets of ACs and non-ACs are unbalanced; third, predictions need to be made at the level of compound pairs, rather than individual compounds, which is usually the case in compound classification or molecular property prediction. Initial attempts to predict ACs were reported a decade ago.8,9 ACs were first accurately predicted using support vector machine (SVM) modeling on the basis of special kernel functions enabling compound pair predictions.9 These findings have also catalyzed further AC predictions using SVR variants10–12 and other methods,13–18 as discussed below. Recently, various deep neural network architectures have been used to predict ACs from images14,15 and molecular graphs using representation learning16 or derive regression models for potency prediction of AC compounds.17,18
In this work, we further extend this methodological spectrum by introducing chemical language models for combined AC prediction and generative compound design. Compared to earlier studies predicting ACs using classification models, the approach presented herein was designed to extend AC predictions with the capacity to produce new AC compounds, thus integrating predictive and generative modeling in the context of AC analysis and AC-based compound design.
MMPs comprising the general data set were represented as triples:
(CompoundA, CompoundB, PotencyB − PotencyA). |
CompoundA represented the source compound that was concatenated with the potency difference (PotencyB − PotencyA) while CompoundB represented the target compound. Each MMP yielded two triples, in which each MMP compound was used once as the source and target compound, respectively. The source and target compounds were then used as the input and associated output for model training, respectively. Furthermore, for MMP-triples, data ambiguities could arise if an MMP was associated with multiple potency values for different targets or if a given source compound and potency difference was associated with multiple target compounds from different activity classes. Such MMPs were eliminated. Finally, for the general data set, a total of 338748 qualifying MMP-triples were obtained.
For modeling, MMP-triples were randomly divided into training (80%), validation (10%), and test (10%) sets. Source and target compounds from MMP-triples displayed nearly indistinguishable potency value distributions.
For the initial evaluation of chemical language models, three different test (sub)set versions were designed:
(i) Test-general: complete test set of 33875 MMP-triples excluded from model training.
(ii) Test-core: subset of 2576 test set MMP-triples with core structures not present in training compounds.
(iii) Test-sub: subset of 14193 MMP-triples with substituents (R-groups) not contained in training compounds.
For the generation of the training subsets, compounds were decomposed into core structures and substituents via MMP fragmentation.5
MMP-Cliffs and MMP-nonCliffs were extracted from four large activity classes including inhibitors of thrombin (ChEMBL ID 204) and tyrosine kinase Abl (1862) as well as antagonists of the Mu opioid receptor (233) and corticotropin releasing factor receptor 1 (1800). For MMP-Cliffs and MMP-nonCliffs, triples were ordered such that CompoundA had lower potency than (or equal potency to) CompoundB. These activity classes were excluded from the general data set and their MMP-Cliffs and MMP-nonCliffs thus formed an external/independent test set for AC prediction (Table 1).
Target name | ChEMBL ID | Total MMPs | MMP- Cliffs | MMP-nonCliffs |
---|---|---|---|---|
Thrombin | 204 | 4249 | 438 | 2976 |
Mu opioid receptor | 233 | 5875 | 329 | 4319 |
Tyrosine kinase Abl | 1862 | 5403 | 564 | 3093 |
Corticotropin releasing factor receptor 1 | 1800 | 3068 | 317 | 1889 |
(Source compound, Potency difference) → (Target compound). |
Then, given a new (Source compound, Potency difference) test instance, trained models were supposed to generate a set of target candidate compounds meeting the potency difference condition.
Sequence-to-sequence (Seq2Seq) models represent an encoder–decoder architecture to convert an input sequence (such as a character string) into an output sequence.22 These models can be adapted for a variety of applications, especially for neural machine translation.22 The encoder reads an input sequence and compresses it into a context vector as its last hidden state. The context vector serves as the input for the decoder network component that interprets the vector to predict an output sequence. Because long input sequences often present challenges for generating context vectors,23 an attention mechanism24 was introduced that utilizes hidden states from each time step of the encoder. As a further advance, a transformer neural network architecture was introduced that only relies on the attention mechanism.25 The transformer architecture comprises multiple encoder–decoder modules (Fig. 1). An encoder module consists of a stack of encoding layers composed of two sub-layers including a multi-head self-attention sub-layer and a fully connected feed-forward network (FFN) sub-layer. Multi-head attention has multiple, single attention functions acting in parallel such that different positions in the input sequence can be processed simultaneously. The attention mechanism is based upon the following function:
(1) |
The input for the attention layer is received in the form of three parameters including query (Q), keys (K), and values (V). In addition, a scaling factor dk (equal to the size of weight matrices) prevents calculations of excessive dot products.25 More details concerning the attention function are provided in the original literature of the transformer model.25 The FFN sub-layer employs rectified linear unit (ReLU) activation.26 The multi-head self-attention and FFN sub-layers are then linked via layer normalization27 and a residual skip-connection.28 Each decoder layer contains three sub-layers including an FFN sub-layer and two multi-head attention sub-layers. The first attention sub-layer was controlled by a mask function.
In this work, all source and target molecules were represented as canonical SMILES strings generated using RDKit29 and further tokenized to construct a chemical vocabulary containing all the possible chemical tokens. The start and end of a sequence were represented by two special “start” and “end” tokens, respectively. For AC prediction, models must be guided towards the generation of compounds meeting potency difference constraints. Therefore, potency differences captured by MMPs were tokenized by binning.23 The potency difference, ranging from −8.02 to 9.53, was partitioned into 1755 bins of width 0.01 that were also added to the chemical vocabulary. Each bin was encoded by a single token and each potency difference was assigned to the token of the corresponding bin (Fig. 1), e.g., a potency difference of 2.134 was encoded as ‘pKi_change_(2.13, 2.14)’. Accordingly, the tokenization preserved the quantitative relationship between bins. The SMILES representation of a source compound combined with its potency difference token then represented the input sequence for the transformer encoder and was converted into a latent representation. Based on this representation, the transformer decoder iteratively generated output SMILES sequences until the end token was obtained. During training, the transformer model minimized the cross-entropy loss between the ground-truth target and output sequence.
Model | Hyperparameters | Value space for optimization |
---|---|---|
SVM | Kernel function | ‘Linear’, ‘sigmoid’, ‘poly’, ‘rbf’, ‘tanimoto’ |
C | 1, 10, 100, 1000, 10000 | |
Gamma | 10−6, 10−5, 10−4, 10−3, 10−2, 10−1 | |
RF | Max_depth | 3, 4, 5, 6, 7, 8, 9, 10 |
Max_features | 32, 64, 128, 256, 512, 1024 | |
n_estimators | 1, 2, 4, 8, 16, 32, 64, 100, 200 | |
XGboost | Max_depth | 3, 4, 5, 6, 7, 8, 9, 10 |
n_estimators | 1, 2, 4, 8, 16, 32, 64, 100, 200 | |
Learning_rate | 0.0001, 0.001, 0.01, 0.1, 0.2, 0.3 | |
Subsample | 0.5, 0.6, 0.7, 0.8, 0.9, 1 | |
Min_child_weight | 0, 1, 2, 3, 4, 5 |
(2) |
MMPtest and MMPrepro denote the number of MMP-triples that were tested and reproduced by a model, respectively. Notably, this definition of reproducibility directly corresponds to the recall of labeled instances for classification models.
AC predictions were also evaluated by determining the true positive rate (TPR), true negative rate (TNR), and balanced accuracy (BA),37 defined as:
(3) |
(4) |
(5) |
TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives respectively.
(Source compound, Potency difference) → (Target compound). |
Then, given a new (Source compound, Potency difference) test instance, the pre-trained models should generate target compounds with appropriate potency. For deriving pairs of source and target compounds, the MMP formalism was applied. For AC prediction, pre-trained models were subjected to fine-tuning on MMP-Cliffs and MMP-nonCliffs from given activity classes, corresponding to the derivation of other supervised machine learning models.
Test-general | Test-core | Test-sub | |
---|---|---|---|
Seq2Seq | 0.719 | 0.370 | 0.759 |
Transformer | 0.818 | 0.528 | 0.850 |
For the entire test set, the Seq2Seq and transformer model achieved reproducibility of 0.719 and 0.818, respectively. Hence, the models were able to regenerate more than 70% and 80% of the target compounds from MMP-triples not used for training, respectively. However, reproducibility was consistently higher for the transformer and all training set versions than for the Seq2Seq model (Table 3). Hence, preference for AC prediction was given to the transformer. The test-general reproducibility of more than 80% was considered high. Attempting to further increase this reproducibility might compromise the ability of the model to generate novel compounds by strongly focusing on chemical space encountered during training. As expected, the test-core reproducibility was generally lowest because in this case, the core structures of MMPs were not available during training (limiting reproducibility much more than in the case of test-sub, i.e., evaluating novel substituents).
Reproducibility | Activity classes | |||
---|---|---|---|---|
ChEMBL204 | 1862 | 233 | 1800 | |
MMP-Cliffs | 0.050 | 0.007 | 0.049 | 0.006 |
MMP-nonCliffs | 0.185 | 0.081 | 0.188 | 0.035 |
Therefore, a transfer learning approach was applied by fine-tuning the pre-trained transformer on these activity classes. For fine-tuning, 5%, 25%, and 50% of MMP-Cliffs and MMP-nonCliffs of each class were randomly selected. The resulting models were then tested on the remaining 50% of the MMP-Cliffs and MMP-nonCliffs.
Only 5% of the training data were required for fine-tuning to achieve reproducibility rates of 70% to greater than 80% for MMP-Cliffs from the different activity classes (Fig. 2A, solid lines). For MMP-nonCliffs, 25% of the training data were required to achieve reproducibility between 60% and 80% for the different classes (Fig. 2B, solid lines). For practical applications, these findings were encouraging because for any given target, there were many more MMP-nonCliffs available than MMP-Cliffs.
Fig. 2 Reproducibility of MMP-Cliffs and MMP-nonCliffs after fine-tuning. For (A) MMP-Cliffs and (B) MMP-nonCliffs from different activity classes (identified by ChEMBL target IDs according to Table 1), reproducibility is reported as a function of transfer ratio accounting for the percentage of training data used for fine tuning. Solid lines represent results for true MMP-Cliffs and MMP-nonCliffs and dashed lines for control data obtained by inverting potency differences for MMP-Cliffs and MMP-nonCliffs. |
Furthermore, to directly test whether high reproducibility achieved through fine-tuning only depended on learning structural relationships encoded by MMPs or if potency differences were also learned, a prerequisite for meaningful AC prediction, control calculations with inverted potency differences were carried out. Therefore, for all MMP-Cliffs, potency differences were set to ΔpKi = 0.1 and for all MMP-nonCliffs, potency differences were set to ΔpKi = 2.0. Using these hypothetical (SAR-nonsensical) data as test instances, reproducibility rates were determined again. In this case, reproducibility rates remained well below 50% for both MMP-Cliffs (Fig. 2A, dashed lines) and MMP-nonCliffs (Fig. 2B, dashes lines) and further decreased with increasing amounts of training data used for fine-tuning. These findings conclusively showed that the conditional transformer associated structural relationships with corresponding potency differences, thereby learning to reproduce and differentiate between MMP-Cliffs and MMP-nonCliffs.
In the following, the conditional transformer for AC prediction is referred to as DeepAC.
We also evaluated the capability of the model to reconstruct both MMP-Cliffs and MMP-nonCliffs originating from the same source compound. For each activity class, we compiled a set of source compounds from the original test data. Then, models were fine-tuned with varying amounts of data and applied to reproduce MMP-Cliff and MMP-nonCliff target compounds from the same source compound. As shown in Fig. 3, DeepAC reproduced more than 80% of the target compounds using 5%, 25%, or 50% of fine-tuning data, depending on the activity class.
The predictions using different methods were generally stable, yielding only low standard deviations over independent trials (Fig. 6). Using 5% of training data for fine-tuning or model derivation, the recall (TPR) of MMP-Cliffs was consistently higher for DeepAC than the reference methods, which failed on two activity classes (Fig. 6). For increasing amounts of training data, recall performance of the reference methods further increased and SVM reached the 80% or 90% recall level of DeepAC in two cases when 50% of available data were used for training (Fig. 6).
Fig. 6 Recall of MMP-Cliffs. For four different methods, recall/reproducibility of MMP-Cliffs is reported for (A) thrombin inhibitors, (B) Mu opioid receptor ligands, (C) corticotropin releasing factor receptor 1 ligands, and (D) tyrosine kinase Abl inhibitors. Average recall over five independent trials is reported for increasing amounts of training data randomly selected from the complete data set (error bars indicate standard deviations). Statistical tests are shown according to Fig. 4. |
For MMP-nonCliffs, representing the majority class for the predictions, a different picture was obtained. Here, the recall of reference methods for increasing amounts of training data was mostly greater than 90% and significantly higher than of DeepAC (Fig. 7). For DeepAC, recall/reproducibility increased with increasing amounts of training data and reached highest performance very similar to the reference methods for two activity classes when 50% training data were used.
Fig. 7 Reproducibility of MMP-nonCliffs. In (A)–(D), reproducibility of MMP-nonCliffs is reported using four different methods. Statistical tests are shown according to Fig. 4. |
Calculation of BA for the prediction of MMP-Cliffs and MMP-nonCliffs gave similar results for all methods (Fig. 8). The level of 80% BA was generally reached for 25% or 50% training data. For largest training sets, all methods were comparably accurate for two activity classes, SVM reached highest accuracy for one class, and DeepAC for another (Fig. 8). Compared to the other methods, DeepAC produced higher TPR and lower TNR values, resulting in overall comparable BA. Clearly, a major strength of DeepAC was the ability to accurately predict MMP-Cliffs on the basis of small training data sets.
Fig. 8 Prediction accuracy. Reported are mean BA values and standard deviation (error bars) for the prediction of MMP-Cliffs and MMP-nonCliffs. In (A)–(D), results are reported using four different methods and statistical tests according to Fig. 4. |
Study | AC criteria, similarity/potency difference | Prediction task | Methods | Prediction accuracy |
---|---|---|---|---|
a Abbreviations: SVM/R (support vector machine/regression); F1 (mean F1 score); AUC (area under the ROC curve); MCC (Matthews' correlation coefficient); 3D-ACs (three-dimensional activity cliffs); VLS (virtual ligand screening); CGR (condensed graphs of reaction); CNN (convolutional neural network); MCS (maximum common substructure); RF (random forest); DNN (deep neural network); GCN (graph convolutional network); MPNN (message passing neural network); GAT (graph attention network); RMSE (root mean square error); KNN (K-nearest neighbor); GBM (gradient boosting machine); AFP (attentive fingerprint); LSTM (long short-term memory network). | ||||
Heikamp et al.9 | MMP/100-fold | ACs for 9 activity classes | Fingerprint-based SVM with MMP kernels | F1: 0.70–0.99 |
Husby et al.13 | Binding mode similarity (80%)/100-fold | 3D-ACs for 9 activity classes | Docking/VLS | AUC: 0.75–0.97 |
Horvath et al.10 | MMP/100-fold | ACs for 7 activity classes | CGR and descriptor recombination-based SVM/SVR | F1: 0.61–0.92 |
Tamura et al.12 | MMP/100-fold | ACs for 9 activity classes | Fingerprint-based SVM with Tanimoto kernel | MCC: ∼0.20–0.80 |
Iqbal et al.14 | MMP/100-fold | ACs from MMP images and R-groups (5 activity classes) | Image-based CNN with transfer learning | F1: 0.28–0.76 |
MCC: 0.24–0.73 | ||||
Iqbal et al.15 | MMP/100-fold | ACs from MMP images (3 activity classes) | Image-based CNN | F1: 0.36–0.85 |
AUC: 0.92–0.97 | ||||
MCC: 0.39–0.83 | ||||
Tamura et al.11 | MMP/100-fold | ACs for 2 activity classes | Fingerprint-based SVM with MMP kernel | AUC: 0.46–0.69 |
MCC: 0.69–0.89 | ||||
Park et al.16 | MMP/100-fold | ACs for 3 activity classes | GCN | F1: 0.34–0.49 |
AUC: 0.91–0.94 | ||||
MCC: 0.40–0.49 | ||||
Jiménez-Luna et al.17 | MCS/10-fold | — | RF/DNN/GRAPHNET/GCN/MPNN/GAT | RMSE: 0.698–1.029 |
Tilborg et al.18 | Scaffold SMILES similarity (90%)/10-fold | ACs for 30 activity classes | KNN/RF/GBM/SVM/MPNN/GAT/GCN/AFP/LSTM/CNN/Transformer | RMSE: 0.62–1.60 |
With DeepAC, we have introduced the use of conditional chemical language models for AC prediction. Given that most studies in Table 5 reported F1 (ref. 38) and Matthews' correlation coefficient (MCC)39 scores for evaluating prediction accuracies, we also calculated these scores for the DeepAC predictions reported herein. With F1 of 0.50–0.78 and MCC of 0.43–0.75, DeepAC also yielded state-of-the-art prediction accuracy (and higher accuracy than recent AC predictions using graph neural networks16). However, DeepAC is principally distinguished from other AC predictions approaches by its ability to generate new compounds meeting AC criteria, which partly motivated its development.
This journal is © The Royal Society of Chemistry 2022 |