Gökçe
Geylan
*ab,
Leonardo
De Maria
c,
Ola
Engkvist
ad,
Florian
David
b and
Ulf
Norinder
efg
aMolecular AI, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden. E-mail: gokce.geylan@astrazeneca.com
bDivision of Systems and Synthetic Biology, Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
cMedicinal Chemistry, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
dDepartment of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, Gothenburg, Sweden
eDepartment of Computer and Systems Sciences, Stockholm University, Kista, Sweden
fMTM Research Centre, School of Science and Technology, Örebro University, Örebro, Sweden
gDepartment of Pharmaceutical Biosciences, Uppsala University, Uppsala, Sweden
First published on 30th July 2024
Being able to predict the cell permeability of cyclic peptides is essential for unlocking their potential as a drug modality for intracellular targets. With a wide range of studies of cell permeability but a limited number of data points, the reliability of the machine learning (ML) models to predict previously unexplored chemical spaces becomes a challenge. In this work, we systemically investigate the predictive capability of ML models from the perspective of their extrapolation to never-before-seen applicability domains, with a particular focus on the permeability task. Four predictive algorithms, namely Support-Vector Machine, Random Forest, LightGBM and XGBoost, jointly with a conformal prediction framework were employed to characterize and evaluate the applicability through uncertainty quantification. Efficiency and validity of the models' predictions with multiple calibration strategies were assessed with respect to several external datasets from different parts of the chemical space through a set of experiments. The experiments showed that the predictors generalizing well to the applicability domain defined by the training data, can fail to achieve similar model performance on other parts of the chemical spaces. Our study proposes an approach to overcome such limitations by the means of improving the efficiency of models without sacrificing the validity. The trade-off between the reliability and informativeness was balanced when the models were calibrated with a subset of the data from the new targeted domain. This study outlines an approach to enable the extrapolation of predictive power and restore the models' reliability via a recalibration strategy without the need for retraining the underlying model.
Integrating ML into drug discovery and development allows more complex challenges to be tackled and has the potential of accelerating the pipeline of drug projects. ML approaches could uncover patterns and help predict the impact of a set of design ideas on permeability across a diverse set of peptide sequences. Training predictive models for cell permeability have been previously investigated with peptide sequence data often including peptides conjugated to other molecules such as small molecules and antisense oligonucleotides. This has led to models that are either built specific to peptides with a certain conjugation partner or indifferent to the influence of any conjugate, posing a challenge in accurate profiling of the cell permeability of individual peptides.4–6 Therefore, there is still an existing need to have a permeability predictor that can predict the membrane permeability of peptides.
The peptidic chemical space has not been explored thoroughly with in silico methods compared to the small molecule space and the availability of comprehensive peptide data sources is rather limited and even more so for cyclic peptides. However, there are a considerable number of studies that have been published on traversing biological barriers for individual cyclic peptides. These studies generally provide the design strategies or methods on target peptide sequences and experimentally validate the achieved permeability upon modifications.7–10 Collecting data from different publications gives rise to two noteworthy considerations. First, the permeability values reported are predominantly on the design strategies that improve permeability. Therefore, compiling datasets from such studies lead to class imbalance towards positive instances for ML purposes. Secondly, combining data from multiple sources requires other assessments such as the risk of data leakage. Data leakage occurs when the model is presented with additional information in the training data about what it will predict.11 This leads to overoptimistic training performance while models suffer from generalizability during application. Data leakage was shown to be a widespread problem when training models and being one of the main causes for the “reproducibility crisis”.11,12 Some good preprocessing practices include not having duplicated entries across or within the training, validation, and test sets, collecting the same assay type, and splitting the test set to represent a never-before-seen external data from separate data sources.11 These preprocessing practices are preferred for the reproducibility and utilization of the models.11 Nevertheless, the advantage of training models with a compiled dataset is to have a robust predictive performance that can generalize across different parts of the chemical space.
The applicability domain of a predictive model is defined as the chemical space learned during training that determines the extent of acquiring reliable predictions. As the test samples move further away from this part of the chemical space and become less similar to the training data, the reliability of the model's predictions decreases, ultimately defining its applicability domain.13 Being informed of the model's boundaries allows the researchers to interpret the predictions as reliable or not and enables an efficient navigation in the explored chemical space. Therefore, acquiring insight on the applicability domain is necessary to have a valid and reliable decision-making process with respect to model predictions. One way to assign reliability to characterize model predictions is uncertainty quantification.14
Conformal prediction (CP) is a mathematical framework, used in conjunction with a previously trained model to provide uncertainty quantification on the model's predictions by utilizing calibration examples.15 The inductive CP methodology produces regression intervals, or prediction sets from point predictions for regression and classification, respectively.15 The CP methodology assumes the training and test data to be exchangeable, implying that the datapoints are not related to each other, in other words, independent.15 The CP application begins with the proper training set to train the predictor and the calibration set to generate the nonconformity scores. The nonconformity scores describe how different a data point is to the previously observed examples. After the model is trained, the calibration set is used to establish the mapping between these nonconformity scores and how confident the model is with its predictions.16 Upon training and calibrating, the model can be tested for its predictive performance by providing never-before-seen test data. If the predicted values are determined to have high nonconformity scores, this is translated to high dissimilarity to the calibration set examples and will be associated with a lower confidence, expressed as p-values. The predictions produced by the model depend on the user-specified significance level (α) which limits the error rate of the predictions. Intuitively, the significance level sets the model's confidence by providing prediction intervals for regression models or one of the prediction sets for classification models.15,17 In a binary classification case, the conformal prediction framework outputs one of the four label sets as opposed to two class labels: positive and negative. The labels are determined by comparing the class-specific p-values of the predicted data point and calibration set examples. If the model either cannot make a reliable prediction under the defined significance level or assigns similar confidences to both classes, the assigned label is “Empty” (a no-label prediction) or “Both”, (a two-label prediction), respectively. In other cases, for a predicted sample, the class having a higher p-value than the set significance level either is assigned a positive or a negative label (a single-label prediction). With the mathematical nature of the CP framework, the method aims to yield valid and meaningful predictions at a user-determined significance level.17
Traditionally, the training set is used to generate the proper training and the calibration sets. When the trained model is used to make predictions on new data that is dissimilar to the training data, the exchangeability assumption between the calibration and test data may be compromised leading to lower validity than expected.18 One approach to overcome such a situation is adding a fraction of the external test set examples to the calibration set to make them more exchangeable. The incorporation of some of the test set samples to the calibration set was argued to restore the confidence of the models on never-before-seen data. To demonstrate this, the models trained on public toxicity data were shown to have low validity on both more recent time-split or propriety data as the external test set until this recalibration strategy was applied.18,19 The models can make valid predictions on a space outside of the applicability domain through recalibrating with an “updated” calibration set.18 This strategy was, in another study on toxicity prediction, shown to be more effective then retraining the model with the external test data.19 While more effective in producing reliable predictions, it was comparably more efficient as it requires less computational resources than training from scratch.19 Heyndrickx et al. used conformal prediction in a federated learning setting across pharmaceutical companies. In this work, efficiency was established as a metric to assess the extent of the applicability domain and the reliability of predictions on unseen molecule datasets.20 Inspired by the previous small molecule-focused methods, we aim to explore various recalibration strategies on beyond-the-rule-of-5 molecules and to extend the proposed methodology to peptides.
In this study, we conduct a series of experiments to extend the recalibration strategies and show their influence on the model performance and confidence on its predictions. The study follows a systematic investigation of building predictors with various calibration strategies and later, evaluate the impact of the calibration on model's predictions on external test sets, with a particular focus on peptide permeability. The external test sets were selected from never-before-seen data from different data sources to assess the limitations of the predictive models on the diverse beyond-the-rule-of-5 data. The study aims to exhibit a predictive modelling approach for building predictors that can provide reliable predictions with a predefined expected error on new applicability domains, without the need of retraining the model.
The data processing yields 6876 data points, with cycle size ranging between 12 to 46 atoms, from 35 different sources with four of these comprising approximately 90% of the data (Table 1). The processed data was imbalanced with 67.6% of it containing permeable entries. The sparsity of the data sources varied as some sources contained diverse peptides while others were composed of a set of cyclic peptides where stereochemical modifications were introduced to the wild-type amino acid sequence. The principal component analysis and visualizations of the chemical space covered were conducted with ChemPlot.26
In Experiment 4, we aimed to refine the calibration strategy by also representing the chemical space of the external test set in the calibration set. This was achieved by integrating a portion of the external test set into the calibration process in each case. To facilitate this, we divided the external test set into five parts, or folds, using stratified cross-validation, ensuring each fold to preserve the overall distribution of the Permeability labels. In each of the five iterations, one fold, or 20% of the external test set, was added to the calibration set, updating it. Employing the models built in Experiment 1, we have calibrated the models with the updated calibration set to test models' adaptability to the unseen chemical space. The remaining four folds were kept as the external test set, allowing the assessment of any changes in the efficiency and validity of the predictions of the recalibrated models. This procedure was repeated for each fold, and the results were aggregated by averaging these metrics to obtain a more stable estimate of the models' performances.
In Experiment 5, the same strategy as in Experiment 4 was followed to assess another recalibration strategy for the models. For this calibration set, instead of concatenating a fold to the pre-existing calibration set, we utilized each fold as a standalone calibration set for five iterations. During the recalibration, the models were not retrained as the proper training set was kept unchanged. The model performance on the remaining four folds were assessed with the conformal prediction metrics to detect any changes in models' reliability. The impact of the strategies in Experiment 4 and 5 were reported as the change of the efficiency and validity values from the respective values in Experiment 3. The difference in the efficiency scores with a value greater than 0 indicated an improvement of model's efficiency due to the recalibration strategy. Any validity score above 0.8 was levelled to 0.8 before the change is computed. This adjustment was to mitigate potential bias due to the inflated validity as we defined the models' confidence to be 80%. Therefore, any validity scores above 0 signified an improvement towards the specified confidence level while any difference below 0 indicated deterioration of validity.
Lastly, Experiment 6 consisted of an evaluation between the outcomes of the Experiment 4 and 5 where the two recalibration strategies were compared based on their potential improvement of efficiency and validity scores from the traditional conformal prediction established as the baseline in Experiment 3.
In the next scenario, the data splitting and the construction of cross-validation folds were achieved with stratification on both the data sources and the permeability label. During the stratification processes, the data from each source was placed to distinct sets or folds where data from a particular source can only be found in a particular fold. This case will be referred to as “Case 2: split on data sources” for the remainder of this study.
In the third and the final scenario, the splitting strategy was implemented with stratification on the canonical group labels assigned to the peptides. These labels were determined by removing the stereochemistry, canonicalizing the SMILES and checking if this yields identical representations. The groups with identical representation were distributed to folds making sure that the instances belonging to the same group were kept together. The singleton data entries were later added to the respective datasets as 20% calibration and 80% proper training set after stratification on the permeability label. This model will be referred to as “Case 3: split on canonical groups”. The experiments are summarized in Fig. 1.
Fig. 1 The workflow of the Experiments 1–7. Experiment 1 and 2 evaluate the trained model on the internal validation set and the external test set, respectively. Experiment 3, 4 and 5 in order, applies conformal prediction with the calibration set split from the training set, the calibration set augmented by a subset of the external test set and using a part of the external test set alone. Experiment 6 compares the three calibration strategies compares based on the prediction efficiency and validity on the external test set. Lastly, Experiment 7 explores various cross-validation strategies during model building. A summary table can be found in ESI Table 1.† |
The v.2.1.0 package was used to build the conformal prediction framework.33 The conformal prediction adapted to classification tasks was employed. The inductive conformal predictor (ICP) with Mondrian classification, where the nonconformity scores are generated for each class independently, was utilized to prevent error rate shifts due to class imbalances. Next, 10 ICPs were used as an ensemble and aggregated with the aim of enhancing the calibration of prediction intervals by pooling the predictions of the ensemble members. After the conformal prediction structure was completed, the calibration set was used together with the trained model to calculate the nonconformity scores. The fitted conformal predictor was later deployed to obtain predictions on the external test set. The predictions of new data points were carried out without specifying a significance level and the p-values were obtained as the output. The model performance was analyzed by investigating both conformal prediction metrics calculated using the p-values and classical model evaluation metrics from scikit-learn.
In Experiment 7, the model evaluation metrics were calculated for the single-label predictions of the test sets at the significance level of 0.20 additionally. The formulae for the conformal prediction metrics are provided below.
Fig. 2 The chemical space visualization of the processed data obtained from CycPeptMPDB,21 coloured by the sources adapted as external test sets and the remaining data serving as the common training data instances in all experiments. The axes indicate the principal components and the percentage of explained variance. |
To provide context for the need to address the challenges of building predictive models from different data sources, Experiments 1 and 2 were aimed to showcase the outcome of traditional model building practices and to establish the model performance on the validation set and external test sets, respectively. In Experiment 3, the conformal prediction was applied to the models built in Experiment 1 and the model performance on the external test sets were re-evaluated through uncertainty quantification. The applicability domain of the model was investigated through two calibration strategies; expanding the calibration set with a portion of the external test set (Experiment 4) and using only a part of the external test set as the calibration set (Experiment 5) to recalibrate the underlying models. Chemical spaces spanned by the training, calibration and external test sets utilized in the first five experiments are visualized in ESI Fig. 1.† The influence of the calibration strategies on the efficiency and restoring validity of the models were compared in Experiment 6. Lastly, training models with the entire data with different cross-validation strategies was tackled to analyze the reliability of the model in the studied chemical space in its entirety.
Fig. 3 The model performance metrics on the internal validation data. The bar plots show (A) balanced accuracy, (B) ROC-AUC, (C) MCC, (D) sensitivity, (E) specificity, and (F) precision scores for the models, coloured as described in the legend. The models and the performance metrics are labelled with the external data set they will be evaluated on in the next experiment, Experiment 2. The purpose of this labelling is that the models designated by the same labels are built on the same training data and the training data contains all the data sources except for the holdout case label. Scores below or equal to 0 are not visualized in the plot and the tabular data for all the metrics can be found in ESI Table 2.† |
The models except for the “2021 Kelly24” case, generally produced higher sensitivity around 0.90 and lower specificity around 0.60. This reflects the influence of data imbalance on model performance even though the validation data was selected from the same data sources as in the training data where the model would be more confident. All the models showed Matthew's correlation coefficient above 0.50 implying that the predictions from the models correlate with the actual values and the binary predictors could separate the two classes efficiently. However, with the consideration of sensitivity and specificity, the models were more prone to predict the “Positive” or the “Permeable” class. Overall, this experiment shows how a model trained on a specific chemical space performs well on the same space. The models produced predictions with high balanced accuracy with an effective decision boundary separating the two classes while still affected by the data distribution. Since the internal validation data was later used as the calibration set data to apply conformal prediction framework, this experiment also aims to highlight the similarity of the calibration set to the data the models were built on.
Fig. 4 The model performance metrics on the external test sets. The bar plots show (A) balanced accuracy, (B) ROC-AUC, (C) MCC, (D) sensitivity, (E) specificity, and (F) precision scores for the models, coloured as described in the legend. The scores are reported on each model, with distinct colour, and labelled with the name of the external test set. Scores below or equal to 0 are not shown on the plot and the tabular data for the bar plots can be found in ESI Table 3.† |
The increase of the difference between sensitivity and specificity demonstrates the models' predictions to be heavily biased towards the “Permeable” class. Thus, when predictive models are applied on the extrapolated applicability domains, their inherent biases are more pronounced. The change in the predictive power translates into the reduction of the models' abilities to generalize to data outside of its applicability domain.
Fig. 5 Bar plots show conformal prediction metrics; efficiency and validity, calculated on the external test sets predicted by the calibrated models. The metrics were computed at significance level = 0.2, mandating the model to produce predictions with 80% confidence. The efficiency of the instances with true labels of (A) “Non-permeable” and (B) “Permeable”, the validity scores of the instances with true labels of (C) “Non-permeable” and (D) “Permeable” were displayed. The tabular data of the plots can be found in ESI Table 4.† |
The “2021 Kelly24” case stands out with a much higher efficiency and validity for the “Non-permeable” class compared to the “Permeable” class. From the previous experiments, we have observed a predictive bias for the models of this case label to overpredict the majority “Permeable” class. Using conformal prediction, we can further characterize these models and conclude that the minority class examples on the external test set were comparably closer to the calibration set examples than the majority class for these models. Thus, the model can provide comparably more accurate single-label predictions for the “Non-permeable” class.
For the rest of the holdout cases, the low efficiency values imply that decision-making for assigning either one of the classes to the predictions was often not possible at the 80% confidence level. Nevertheless, some of the models showed high validity, above 0.75, for the binary classes while the majority of the models did not achieve this goal with one of the binary classes having a lower score than their counterpart. As the class with higher validity also was not, in general, accompanied by a higher efficiency compared to the other class, it is indicative that the models are producing two-label predictions for these classes (Fig. 5). The valid but inefficient predictions display the model's inability to make accurate single-label predictions for the independent test sets or strengthening the generalizability argument of the training and external sets from Experiment 2.
The validity scores showed either no change or minor increase for the holdout cases of “2021 Kelly24” and “2020 Townsend25”. However, the validity of the “Non-permeable” class increases while the scores for “Permeable” class decreases for all but SVM model for “2016 Furukawa9” and vice versa for “2013 Chugai23”. These external test sets contain 56.4% and 89.9% “Permeable” class, respectively (Table 1). The increase or decrease of the imbalance of the external test sets influences the imbalance in the augmented calibration set. As the imbalance of the classes of the calibration set resembles the external test set, the validity of the classes shifts accordingly. While the validity scores exhibited mixed and distribution-dependent changes across model–data pairs, the mean absolute change in validity was 0.01 (±0.08). Additionally, the changes in validity scores on average for both classes did not show significant reduction with this recalibration strategy. The exact conformal prediction metrics with standard deviations for the recalibration experiment can be found in the ESI Table 5.†
The boost in the efficiency without drastic changes in validity indicates the model achieving better reliability as the nonconformity scores were also calculated partly on the external data set during calibration. The augmented calibration does not provide more valid predictions however, under the pre-set error rate, it makes more efficient predictions without the need of retraining the underlying model. This signifies that the model has shifted from two-label predictions, or predicting “Both” class, from Experiment 3 to provide more single-label predictions of “Permeable” and “Non-permeable”.
The recalibration in this experiment was merely done with the subsets of the external test set on the models from Experiment 1 and 2 and without the original calibration set samples from Experiment 3. Similar to the previous recalibration experiment, the calibration was conducted 5 times with individual folds from the 5-fold stratified split of the external test set. The remaining four folds in each case was kept as the external test set and means of the conformal prediction metrics are calculated on predictions on the remaining external test set instances. The conformal prediction metrics with standard deviations for this experiment can be found in the ESI Table 6.†
Almost all the models again showed improvements in efficiency for both “Permeable” and “Non-permeable” classes except for “2016 Furukawa9” case (Fig. 6B). For the rest of the models, we see a clear increase in the efficiency where the models provided more single-label predictions. Moreover, the validity scores were either preserved or in some cases improved for both classes. The validities of the “Permeable” classes were generally increased as well as the efficiency indicating that the models provided both more valid and efficient, or simply more accurate single-label, predictions to the external test set samples. The “Non-permeable” class generally showed no change in the validity but comparably greater increase in efficiency from the traditional calibration method in Experiment 3 (Fig. 6B). Therefore, the models' performance on both classes were strengthened compared to prediction results from Experiment 3.
Improving efficiency of the models' predictions without sacrificing validity was demonstrated in this experiment by using a subset of the external test set as a calibration set alone to recalibrate the models. This recalibration strategy was able to expand the reliability of the predictions to the never-before-seen applicability domains under the assumptions that the training data and the external test data are noticeably varied.
The span of the validities for both classes are more similar for both recalibration strategies compared to our baseline. The models' predictive performance was balanced between the binary classes when we diverged from the traditional calibration methods using only the calibration set parsed from the training set. Additionally, there was consistency between the medians of both recalibration strategies as the spread of the validity scores are narrower for the predictions coming from models calibrated by the external test set alone.
In Experiments 4 and 5, the validity scores were compared with values from Experiment 3 after all the scores were bounded to 0.8, the pre-defined confidence level. The validity scores for the first recalibration strategy were spreading a range around 0.8 and below the baseline experiment for “Non-permeable” class while the validities for the second one was only having scores above 0.8. This implies that threshold at 0.8 was exceeded in all the model and external test sets for each holdout cases for the latter recalibration strategy. Therefore, the validity scores from models calibrated with the external test sets alone were able to protect the validity from being compromised with the exchange of better efficiency. The trade-off between efficiency and validity was optimized better when using the calibration set only representative of the extrapolated applicability domain. Considering the size of the datasets, “2016 Furukawa9” and “2013 Chugai23” were smaller and thus, the subset of 20% of these cases entailed a smaller calibration set. Since the major recovery of the validity scores between the two experiments were in these holdout cases, the influence of the original calibration set was clearly more prominent when the augmentation was achieved with smaller datasets.
The predictive models built with these cross-validation strategies were generally found to yield models with high efficiency (>0.7) and high validity (>0.8) for both classes (Table 2). The high efficiency and validity show that the training, calibration, and test sets are generated in an exchangeable manner. The training data are now, to a large extent, coming from the large size data sources and the external test sets utilized in the previous experiments as holdout cases are now placed in the training set. Therefore, training and test sets are more representative of the entire chemical space covered, contributing to the generalizability of the model.
Case | Significance level = 0.2 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Split cases | Model name | Efficiency 0 | Efficiency 1 | Validity 0 | Validity 1 | Balanced accuracy | Precision | Sensitivity | Specificity | MCC |
Baseline | RF | 0.80 | 0.82 | 0.80 | 0.84 | 0.78 | 0.87 | 0.80 | 0.75 | 0.54 |
XGBoost | 0.86 | 0.85 | 0.83 | 0.83 | 0.80 | 0.89 | 0.80 | 0.80 | 0.57 | |
LightGBM | 0.74 | 0.76 | 0.85 | 0.85 | 0.80 | 0.89 | 0.80 | 0.79 | 0.57 | |
SVM | 0.77 | 0.77 | 0.88 | 0.84 | 0.82 | 0.91 | 0.79 | 0.85 | 0.60 | |
Split on data sources | RF | 0.77 | 0.79 | 0.89 | 0.86 | 0.84 | 0.92 | 0.83 | 0.86 | 0.66 |
XGBoost | 0.78 | 0.77 | 0.87 | 0.86 | 0.82 | 0.91 | 0.81 | 0.83 | 0.62 | |
LightGBM | 0.86 | 0.80 | 0.87 | 0.82 | 0.81 | 0.90 | 0.77 | 0.84 | 0.59 | |
SVM | 0.72 | 0.66 | 0.91 | 0.85 | 0.82 | 0.92 | 0.76 | 0.88 | 0.62 | |
Split on canonical groups | RF | 0.76 | 0.76 | 0.87 | 0.84 | 0.81 | 0.91 | 0.79 | 0.83 | 0.59 |
XGBoost | 0.80 | 0.82 | 0.80 | 0.84 | 0.78 | 0.87 | 0.80 | 0.75 | 0.54 | |
LightGBM | 0.86 | 0.85 | 0.83 | 0.83 | 0.80 | 0.89 | 0.80 | 0.80 | 0.57 | |
SVM | 0.74 | 0.76 | 0.85 | 0.85 | 0.80 | 0.89 | 0.80 | 0.79 | 0.57 |
Case 2: split on data sources shows similar efficiency scores and slightly higher validity scores than for Case 1: baseline. This yields more “Both” class predictions, showing that the model learns the chemical space more comprehensively and can be leveraged for more single-label predictions by lowering the significance level. Moreover, the rest of the explored metrics show similar results between the baseline and the split methodologies we have employed in the earlier experiments. This draws focus to the question of whether all the data sources are distinctly diverse from each other. To ensure that the chemical space is learned exhaustively during model building, we explored a third cross-validation scenario where the data points were grouped according to their stereochemical counterparts. Removing stereochemical information from the structures of cyclic peptides during canonicalization, allowed a new grouping strategy. The cyclic peptides were stereochemical variants of each other were kept in the same folds. The models were built with cross-validation where the folds contain diverse analogous peptide sequences, disregarding which data source they were part of. The peptides that were not grouped were later mixed in equal distributions to the generated folds. Models built with this setup, Case 3: split on canonical groups, was expected to be cross-validated on a harder task and thus, comprehensively explore the chemical space during training. The efficiency on both “Permeable” and “Non-permeable” classes show similar scores, but slightly higher for the “Permeable” class, for the models of this case compared to the models from Case 2. However, the validity of the binary classes was either on par or slightly lower with Case 2 although still above 0.8, the pre-defined confidence level. These results indicate that Case 2 models would need lower significance level to utilize the models for efficient predictions whereas Case 3 can be used for more single-label predictions with a slight drop in validity. Since both cases exhibit similar performances overall considering the trade-off between efficiency and validity, the models learned the chemical space similarly and could assign accurate and reliable single-label predictions at the set uncertainty level regardless of the splitting strategy.
Across cross-validation strategies, the comparison is not based on the models trained on the same training set or predicting the same test set. Therefore, one might question the fairness of a such comparison. However, the aim is to look at how confident these models are with respect to the conditions they were trained and validated on as well as to highlight the importance of considering the diversity of the descriptor space in addition to focusing only on data sources. The cross-validation split strategies were introduced to provide a starting point for different model building processes. In addition to the calibration set selection, the predictive models built on different data sources with the proposed methodologies can be used to produce reliable predictions for their respective applicability domain. Furthermore, these models can also be extrapolated to uncharted applicability domains with good performance.
When building a predictive model, the data is split into training and validation datasets spanning overlapping or similar domains. The model trained on the training set is then, typically used to predict the validation set to assess the model's performance. Even though this shows that the model performs well to the applicability domain spanned by the training set, it does not establish the model's generalizability to its full potential. In our first experiment, we have shown that the trained permeability predictors perform well on the validation set spanning similar chemical spaces with the training set with high balanced accuracy and MCC. However, these models did not exhibit the same predictive power when it comes to an unseen data from a new data source, different from the learned chemical space. The performance metrics in Experiment 2, indicate that the model's predictions are more random with the drastic decline in all the metrics with balanced accuracy around 0.5 and MCC around 0. Since the class imbalance in the training data and the external data are different, the model provides predictions with the distribution it learned where the “Permeable” class dominates rather than the learned intricacy of the chemical space. This was reiterated when we explored the reliability of a model's predictions through conformal prediction in Experiment 3. Conformal prediction methodology allows the uncertainty to be investigated through a user-defined confidence where the predictions are evaluated on how valid and efficient they are. We have calibrated our models from with the validation set and predicted the external test sets. Under 80% confidence level, the predictions had poor efficiency scores but not inflated validity. This suggests that the models were not able to distinguish between the classes as a good portion of the peptides were predicted to be the “Both” class. The first three experiments highlight the importance of understanding the applicability domain and the generalizability of the model before relying on any predictions on new peptides in design-make-test-analyze cycle. If one wants to obtain efficient predictions, the only option for these models would be lowering the required confidence level as the validity becomes compromised. This compromise puts the practical applicability of the model at risk. To mitigate this risk, we evaluated new calibration strategies to make the external data more exchangeable with the calibration set and therefore, re-assessed the reliability of the model.
Experiment 4 and 5 focus on two calibration strategies where the underlying trained models were kept constant, and the confidence defined for the models was not sacrificed. The first recalibration strategy was augmenting the calibration set in Experiment 3 with a subset of the external set and the second one was employing a portion of the external test set alone as the calibration set. Both recalibration strategies extended the reliability of the models' predictions by improving the exchangeability of the calibration and test sets. This enhanced the models' efficiency without necessarily sacrificing the validity of predictions compared to the original calibration strategy. We conclude, in Experiment 6, that using only a portion of the external test set to calibrate the model resulted in more reliable predictions as the validity was preserved or improved across all the external test sets of different sizes and class imbalances. Additionally, the importance of the choice of calibration set was evident as the influence of the original calibration instances decreased the validity and in turn reliability for both classes. Consequently, we have established a proof-of-concept study for building uncertainty-aware predictive models for peptides and described the methodology through permeability prediction. In the real-world use, predictors trained on the public data can be used to predict proprietary or more recent public data by characterization of a small portion of the targeted chemical space. The newly characterized data can be leveraged as the calibration set to recalibrate the model to later provide predictions on the remaining part of the data. This methodology enables to make full use of the model with the confidence of its applicability domain. Additionally, the recalibration is less resource intensive as the underlying models do not need to be trained. Exploring additional factors such as the sizes of the calibration or the training set, selecting the calibration set with distinct compositions or the choice of nonconformity measure in subsequent research could provide new insights on the impact of the variations in conformal prediction framework on efficiency and validity.
In the final experiment, fold splitting strategies in cross-validation were examined. As the data used in this study is a compilation from diverse permeability studies of peptides, the previous experiments were assessing the reliability of building a model on data from these studies and predicting data from a different source. In these experiments, cyclic peptides contained stereochemical information impacting their permeability thus, chirality was not removed during the preprocessing. Grouping based on their canonical representations, without the chirality, resulted in groups of peptides with the same stereo-agnostic representation. Using this grouping as fold splitting strategy in Experiment 7, resulted in preserving highly similar peptides from various data sources together during cross-validation. Models built with this strategy had similar balanced accuracy and validity, but slightly higher efficiency compared to constructing the splits on the permeability labels. However, the models' performances were later observed to be on par with grouping on the data sources. As these models provide more informative or single-label predictions, they generalize better to the provided chemical space. This result also showcases the importance of considering the data sources in such cases as much as the structural commonalities during model building. Building on the insights gained, future studies can investigate how to assess the applicability domain across different types of molecules such as whether a model trained on small molecules can be applied on peptides. Furthermore, the modelling approaches such as transfer learning with conformal prediction can be explored to potentially expand the model's applicability through domain adaptation. The three-party similarities of the source, target and test domains and the impact of these on the model performance would have to be carefully assessed by quantifying the reliability of predictions.
Balancing the trade-off between the efficiency and validity is essential for the practical applications of predictive models to produce correct and informative predictions. After training a model that learns the provided chemical space, the choice of calibration set was shown to be important for extension of the models' reliability to new domains of interest. The use of various algorithms and test sets in this study through a set of experiments suggests a generic approach for probing the applicability domains. In conclusion, this study offers a methodology to attain flexible applicability profile for predictive models where the models can provide reliable membrane permeability predictions on unseen or uncharted peptidic chemical spaces.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00056k |
This journal is © The Royal Society of Chemistry 2024 |