Cayque Monteiro Castro
Nascimento
,
Paloma Guimarães
Moura
and
Andre Silva
Pimentel
*
Departamento de Química, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro, RJ 22453-900, Brazil. E-mail: a_pimentel@puc-rio.br
First published on 21st July 2023
The local interpretable model-agnostic explanations method was used to interpret a machine learning model of toxicology generated by a neural network multitask classifier method. The model was trained and validated using the Tox21 dataset and tested against the Clintox and Sider datasets, which are datasets of marketed drugs with adverse reactions and drugs approved by the Federal Drug Administration that have failed clinical trials for toxicity reasons. The stability of the explanations is proved here with a reasonable reproducibility of the sampling process, making very similar and trustful explanations. The explanation model was created to produce structural alerts with more than 6 heavy atoms that serve as toxic alerts for researchers in many fields of academics, regulatory agencies and industry such as organic synthesis, pharmaceuticals, toxicology, and so on.
With the advancement of data science and big data, the availability of information about chemical structures has increased considerably. Filter structures with undesired physical–chemical properties in virtual libraries can reduce the universe from millions to a few thousand drug candidates.25 In addition, it is highly desirable to have filters for functional groups or fragments commonly considered as inappropriate26 to increasingly develop virtual screenings.27–31 Web servers and expert systems in structural alerts are highly developed in the literature,2,21,22,31–39 but interpretable machine learning models and model explainers are still not widely investigated with the intent to obtain structural alerts and toxic alerts.21
While expert systems in structural alerts can highlight the potential dangers of chemical substructures, automated methods with machine learning techniques and neural networks can be superior in terms of their predictive performance.3,40,41 It is important to note that, despite all the progress made by these tools, the correlation between the interpretation of toxicological data and the mechanism of action of the compound in the human body is still a challenge.5,22,31,32,40
Molecular featurization may effectively extract structural and chemical insights from any given drug using fingerprints42 and molecular descriptors3,22,29,30,32,43,44 with a prediction-based approach applied to large datasets. After doing that and building a model, it is usual to perform the dataset validation.5,24,45,46 This validation delivers a comprehensive view of the model performance using an unknown data.34,47 It is feasible to identify which features are the most striking in a model prediction.21,35,36 To obtain a full comprehension of machine learning models, its interpretation is highly necessary. Although it is difficult to understand why some predictions are incorrect while others are correct, local interpretable model-agnostic explanations (LIME)48 may be used to explain the model in a way that is understandable to humans. It is important to note that SHapley Additive exPlanations (SHAP)48–51 is more commonly used compared with LIME, but it is extremely relevant to apply the latter to understand its interpretation capacity. LIME was proposed in 2016.52 In the LIME proposal, LIME has been applied to image recognition and text classification. Since then, LIME has not been applied to toxicology as well as the method SHAP. Some tabular classification on non-related toxicology subjects has been described, but not published at all. As far as we know, LIME has not been used in toxicology yet, and its application to toxicology has not been scrutinized, especially for structural alerts. The innovation of our study is the application of LIME to generate structural alerts which is very important to the pharmaceutical industry and toxicology research. It is interesting that many developers of model explainability algorithms do not exactly understand why these algorithms make such explanations.53 Currently, there are many methods to enlighten the explanations of a model such as feature importance, local explanations, rule extraction, layer-wise relevance propagation, attention mechanisms, model distillation, and counterfactual explanations. These approaches can be used individually or in combination, depending on the specific requirements and nature of the problem. It is important to note that model explainability is an active research area, and new techniques continue to emerge, offering further possibilities to enhance the interpretability of machine learning models.
Although SHAP is more used, considered more robust, and slower than LIME,21 it is important to explore a simpler and faster method to determine structural alerts with different databases and test its limitations. LIME can approximate any black box machine learning model to a local interpretable model to explain each individual prediction, in addition to being an extremely popular explainer.52–54 However, it is believed that the search for a faster method with lower computational cost can be relevant and add novelty to the area of toxicology, data science and chemoinformatics. SHAP and LIME are reasonable methods to interpret models. In principle, SHAP is better than LIME because the first mathematically guarantees the consistency and accuracy of explanations. However, the model agnostic implementation of SHAP (KernelExplainer) is much slower than that of LIME. For large datasets such as Clintox, Sider, and Tox21, SHAP is computationally more expensive than LIME to use the entire dataset. As we must rely on approximations, this reduces the accuracy of the explanation. For example, SHAP explains the prediction deviation from the expected value calculated using the training dataset. Depending on that, it is faster to calculate the expected value using a specific subset of the training set as compared to the use of an entire training set. With the subset of the training set, this considerably reduces the accuracy of the explanation. Therefore, we did not compare SHAP and LIME because LIME is much faster than SHAP for the purpose of our study. Although LIME has been widely explored recently in interpreting the severity of patients diagnosed with COVID-19,55 the consistency and instability of LIME, which is a target of criticism,56 must be analyzed in more studies for advancement in the subject.
Here, a comprehensive outline is given of the explainability of toxicity models of drugs created using machine learning. The purpose is to explain how these models may in fact be remarkably explainable. This study shows advances in the field because LIME51 is used to interpret toxicology models of drugs using the Tox21 dataset57 of qualitative toxicity measurements on biological targets, including nuclear receptors and stress response pathways. Then, the explained model is applied to the SIDER dataset of marketed drugs with adverse reactions58 and to the Clintox dataset of drugs approved by the Federal Drug Administration (FDA)57 that have failed clinical trials for toxicity reasons. From these interpretations, a list of unwanted substructures, i.e., a structural alert list, is built to filter possible toxic drugs from large libraries, which avoids waste of time and effort doing useless synthesis and research of compounds that may fail in clinical trials for toxicity or undesirable reasons. It is always good to remember that “alerts are just alerts” and should be seen as hypotheses for undesirable mechanisms and never as rules.24
Under the assumption of an explainer being trustful and interpretable, LIME minimizes the following explanation model equation:
The random uniform sampling for local exploration is performed from x to create a full training data set, i.e., create multiple z from a single row of x, which is weighted by πx to be focused on z data closer to x. The sparse linear explanation assumes that (1) g(z) = wz; (2) the locality-aware loss is a square loss; and (3) the proximity weighing for the samples is:
In the first step of the method, a column was created in the data frames dfclintox, dfsider, and dftox21 using the AddMoleculeColumnToFrame function of the RDKit library60,62 in which each value corresponded to the sketch of the molecular structure of the corresponding SMILES representation (ROMol structure).68 Thus, Morgan fingerprints71 were generated as a vector of bits for all molecules of the data frames using the rdFingerprintGenerator.GetFPs class to create a column. As this command works only when there is a structure drawn in the ROMol column, it was therefore necessary to previously remove the lines of all smiles that could not be transformed into ROMol structures.
The droppingIdenticals function was created to calculate the Tanimoto similarity of a given smile representation to compare with all molecules in the Tox21 dataset. All molecules that had similarity equal to 1.0 were eliminated from the Tox21 data frame. Finally, the droppingIdenticals function returned two data frames that contained all molecules in common between the Tox21 dataset and the Clintox and Sider datasets.
To ensure that there were no improper deviations, the receiver operating characteristics (ROC) curve (AUC) statistical method was used to calculate the accuracy score between the model and the test.73 The ROC curve showed how well the model could distinguish between two binary labels (0 for non-toxic or 1 for toxic). The best model can accurately distinguish the binomial. Thus, to simplify the ROC analysis, the area under the ROC curve is nothing more than a way of summarizing the ROC curve into a single value, aggregating all ROC thresholds, calculating the area under the curve. The ROC curve is generated by changing the threshold between classes; the AUC value is not dependent on the threshold. That is, above this threshold, the algorithm classifies in one class and below in the other class. The higher the AUC, the better. An AUC equal to 0.5 indicates a random prediction result, and AUC equal to 0 indicates a perfectly reversed prediction (the prediction values of all positive samples are below those of negative samples). The interesting thing about the AUC is that the metric is scale invariant, as it works with classification accuracy instead of their absolute values. In addition, it also measures the quality of the model predictions, regardless of the classification threshold.
At the end, the classification data of each active molecule for each task is exported. The number of times a fragment is found in molecules that are predicted to be toxic, and how much that fragment contributes to the toxicity of those molecules, are two important factors in understanding how toxicity is predicted. The data is grouped by task. For each of these groups, dictionaries are obtained with the counting frequency of each fragment and the total sum of associated weights for each fragment. Then, these dictionaries are broken up in such a way that the desired data frame is obtained. In addition, the new column with active molecules is created in this same data frame and the index is that of the fragment for a given group. Thus, a list is returned with all molecule ids belonging to the group. Finally, RDKit resources62 are used to highlight the fragment and visualize them in the associated molecule.
All molecules that have been tested against the epidermal growth factor receptor (EGFR) kinase are downloaded by using the ChEMBL web resource client.75 Then, the resource objects for API access are created to get the target data using the UniProt ID P00513.76 The target data is fetched and downloaded from ChEMBL and the result of the query is stored. After checking the entries, the CHEMBL203 entry is selected as the target of interest, which is a single protein of the human EGFR, also named erbB1. The bioactivity data for the target of interest is fetched and only human proteins, bioactivity type IC50, exact measurements, and binding data (assay type ‘B’) are considered. Finally, the query set is downloaded in the form of a Pandas data frame.
The bioactivity data is preprocessed and filtered by converting the datatype of standard value from object to float, deleting entries with missing values, keeping only entries with standard units (nanomolar), deleting duplicate molecules, resetting the data frame index, and renaming the columns. The molecular structures are linked to respective bioactivity ChEMBL IDs. Thus, the compound data are also preprocessed and filtered by removing entries with missing entries, deleting duplicate molecules, and getting molecules with canonical SMILES.68 Then, the values of interest found in bioactivities_df and compounds_df data frames are merged in the data frame output_df based on the ChEMBL IDs of compounds.
The Lipinski rule of five77,78 and PAINS filtering26 were applied to the EGFR dataset (in output_df data frame) for compliancy as already implemented in RDKit.60,62 The number of compounds with and without PAINS was obtained. Then, an external list provided by Brenk et al.25 was used to further filter the EGFR dataset to get the substructure matches and search the filtered data frame for these matches with the unwanted substructures. Then, the structural alert list found here was applied to filter the EGFR dataset. The underlying SMARTS patterns79 in these fragments were used to highlight the substructures within the filtered molecules using RDKit.60,62
From the multitask classifier function from the DeepChem library,60 12 tasks and 1024 features using the ECFP fingerprint were classified to create a multitask classification model that had the best loss function after 200 to 300 epochs as part of the hyperparameter optimization. The featurizers MACCS,80 MAT,60 PubChem,60 and TokenizerSmiles81 were tested as part of the hyperparameter optimization but all results were like those found by the ECFP featurizer so that only this last one is presented here. Table 1 presents the results regarding the metrics obtained in the model validation and testing. It is noted that the data are reasonably promising because the accuracies of the model validation and testing are in the range of 0.71–0.77 in both Clintox and Sider datasets. This means that the model is considerably accurate and coherent for the two test datasets, not showing significant overfitting or underfitting.
Dataset | Run | Validation set | Testing set |
---|---|---|---|
Clintox | #1 | 0.719 | 0.750 |
#2 | 0.746 | 0.765 | |
#3 | 0.735 | 0.761 | |
Sider | #1 | 0.731 | 0.721 |
#2 | 0.731 | 0.731 | |
#3 | 0.730 | 0.727 |
Using the LIME explainer module, it is important to mention that all the values of kernel width tested to explain the trained model did not yield unstable results. It was also verified that the fragments produced in the explanation are mostly similar for the different kernel values tested. Therefore, we decided to use the default value suggested by LIME developers.52 Another important attempt to use only meaningful results in the explanation is the elimination of unimportant fragments. For the sake of conciseness, we decided to eliminate some unimportant fragments with small or negative weights by using the average of all weights. We also attempted to use only the maximum weight or the maximum and minimum weights, but the fragments generated in both attempts were one third of those produced by using the average of weights. Mathematically, the elimination of fragments with negative or small weights seems to be incorrect because we are eliminating fragments in the model that may cause the toxic behavior in a molecule to vanish. So, it makes sense to only look at the maximum and minimum weights mathematically. But chemically speaking, the fragment with minimum weight (non-toxic fragment) does not make the behavior of a fragment with maximum weight (toxic fragment) vanish. At the end, what matters for the behavior in a molecule is the toxic fragment that justifies the use of this cutoff.
Ideally, the molecules to be explained should be only those predicted to be toxic by the model as explained in the Methodology section. However, this is not exactly what is shown in Tables S1 and S2,† which expose a certain number of non-toxic molecules. Despite being considered toxic compounds by the model with a probability of at least 0.8 for a given task, this is not correctly calculated when analyzed individually. This is because the final balance of the sum resulted in negative numbers when considering all the toxicity weight contributions of their respective fragments. For example, Table S1† shows that of the 61 Clintox molecules predicted to be toxic for this task, only 56 are toxic in the NR-AR-LBD task, which gives an accuracy of more than 91% in a separate run. The sum of all weights was less than zero for the other five, even with positive toxic contributions. However, it is noticeable that most molecules that entered the classification were correctly predicted. Thus, the appearance of non-toxic molecules in the classification table is acceptable. Therefore, the model does not have perfect accuracy as observed in the previous metrics (Table 1). Table S2† shows an example of one of the leaderboards produced in the classification of Clintox molecules for task SR-p53. As observed, there are seven molecules classified as toxic (78%) and two molecules as non-toxic (22%) in this separate run. This classification is made using the model results calculated in LIME. It is determined by the balance of toxicity and non-toxicity. It is important to note that the difference between toxicity and non-toxicity for some compounds is subtle. For these molecules, it is not possible to conclude that molecules 190, 216, and 252 are very toxic. It is also important to note that no fragment of any molecule contributed to the toxicity of task NR-PPAR-γ in this separate run. Table 2 presents the number of molecules classified as toxic or non-toxic for each task in Clintox and Sider datasets in three different runs. It also shows that none of the fragments of any molecule contributed to the toxicity of tasks NR-PPAR-γ, SR-MMP and SR-p53 for all runs presented in this study. However, it is important to reaffirm that the model predicts with reasonable accuracy with an acceptable reproducibility as it was repeated a certain number of times (more than shown here).
Run | #1 | #2 | #3 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Clintox | Sider | Clintox | Sider | Clintox | Sider | ||||||
Task | Toxic | Non-toxic | Toxic | Non-toxic | Toxic | Non-toxic | Toxic | Non-toxic | Toxic | Non-toxic | Toxic | Non-toxic |
NR-AR | 67 | 0 | 57 | 0 | 73 | 0 | 62 | 0 | 73 | 0 | 52 | 0 |
NR-AR-LBD | 55 | 2 | 47 | 1 | 61 | 1 | 46 | 2 | 59 | 2 | 41 | 1 |
NR-AhR | 5 | 0 | 2 | 1 | 3 | 3 | 4 | 0 | 3 | 1 | 2 | 1 |
NR-Aromatase | 3 | 1 | 4 | 1 | 4 | 3 | 2 | 1 | 0 | 6 | 2 | 1 |
NR-ER | 30 | 4 | 30 | 6 | 35 | 4 | 31 | 6 | 34 | 2 | 26 | 6 |
NR-ER-LBD | 22 | 0 | 21 | 5 | 26 | 1 | 23 | 1 | 23 | 2 | 17 | 3 |
NR-PPAR-γ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
SR-ARE | 11 | 3 | 9 | 4 | 7 | 10 | 9 | 4 | 9 | 6 | 8 | 4 |
SR-ATAD5 | 2 | 0 | 2 | 0 | 2 | 1 | 2 | 1 | 2 | 0 | 2 | 0 |
SR-HSE | 1 | 1 | 1 | 1 | 3 | 0 | 1 | 1 | 3 | 0 | 1 | 0 |
SR-MMP | 0 | 0 | 0 | 0 | 12 | 7 | 0 | 0 | 0 | 0 | 0 | 0 |
SR-p53 | 0 | 0 | 0 | 0 | 7 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
Table S3† shows only the first two lines of one of the data frames generated from the Sider database. These two fragments that contributed to the toxicity in the biological target NR-AR (task 0) are presented according to the model. In this case, despite the CC(C)O fragment having a lower frequency in the task than the CC(C)(C)C fragment, the total weight of the toxic contribution is higher. In the last column of Table S3,† it is possible to visualize the highlight where such fragments are found in one of the molecules activated by them. For simplicity, the fragment was highlighted only on the first active molecule in the list presented in the previous column (active molecules). Table S3† may lead to a wrong interpretation that fragments that have the largest contribution are those with high frequency in the task. The very small and simple fragments are found in most molecules, so their frequencies are very high in the task. Fig. S1 and S2† present the results after running LIME. They show the small fragments (less than 7 heavy atoms) that influenced the increase in toxicity for the 12 tasks all together in the Clintox and Sider databases, respectively, in terms of counting the frequency of appearance in the analysis. It is interesting to note that most of the fragments found in Fig. S1 and S2† are very simple structures. Because of that, it was decided to pay more attention to the number of heavy atoms of each fragment. It was noticed that the fragments with less than 7 heavy atoms are mostly saturated and unsaturated hydrocarbon fragments, and to a lesser extent, oxygenated and nitrogenated groups. Most importantly, these fragments are also part of the larger and complex fragments with more than 6 heavy atoms that have a low frequency in the task. Thus, it was decided to filter the small fragments from the list, keeping only the fragments with more than 6 heavy atoms to avoid redundancy. The most solid evidence supporting this choice is that the cutoff is in the center of the range. Although this cutoff is easily changed by the user, we tested the cutoffs from 4 to 10. It seems the lower end of the range gives too many fragments, and the higher end gives only a few fragments. So, the middle of the range seems to be the best choice.
Structural alerts provide the basis for grouping compounds into categories that can allow comparisons. In addition, they must be progressively developed to allow the understanding of the mechanism of action of chemical compounds.82 Some scientific evidence can be shown that supports our results. Furans, phenols, nitroaromatics, and thiophenes are found to be toxic alerts in the literature,5 but only some alcohols, nitro compounds and sulfur compounds were found in our study. Li and collaborators83 investigated toxic fragments by frequency analysis and identified some substructures present in highly toxic and moderately toxic compounds. Some of these substructures were found by LIME, such as an alkylfluoride, sulfenic derivate, phosphonic trimester, cyanohydrin and nitrile. Yang and collaborators22 compared small toxic radicals obtained by different techniques (such as machine learning, graphs and expert systems). Many of these alerts were also found by LIME (such as NO2, NN and R–O–X). In addition, this work also shows figures with highlights obtained by different techniques. Most of these substructures were also found by LIME, such as alkyl radicals. Lei et al.84 presented some toxic fragments generated from a database of acute oral toxicity in rats such as alkylfluoride and amines. These fragments are also generated by LIME using the Tox21/Clintox/Sider datasets. Our methodology also found structural alerts with halogenated compounds that are found in studies of endocrine disruption with androgen and estrogen receptors.85 Our study also found nitrogen compounds that are mentioned as structural alerts in the literature.86
It is important to point out that some issues could not be interpreted by LIME. It does not recognize two identical structures with different SMILES representations. For example, CCC[C@H](C)C and CCCC(C)C are the same structures for the NR-ER and NR-ER-LBD tasks in Fig. S1 and S2,† respectively. LIME does not recognize the difference between these two SMILES because these representations are not canonical. Some other similar results appeared during the analysis, but they are not shown here for the sake of simplicity. Sometimes, new molecules are randomly sorted in the splitting of the original dataset that slightly changes the amount that a fragment appears and/or its toxicity contribution. This is probably because the RandomSplitter class was used to train the machine learning model and cross-validate the original database. Because of that, LIME was run several times (three repetitions are shown here, but it was done with many more repetitions for consistency), giving robust results as presented in Fig. 2 and 3. They show 109 and 113 fragments with more than 6 heavy atoms for the Clintox and Sider datasets, respectively. Although LIME interpreted the datasets finding some structural alerts that are present in other webservers or software, it also generated structural alerts that would be difficult for human beings to suggest by expert knowledge.23,48 It is important to mention that this number of heavy atoms is somewhat arbitrary. However, several runs were performed changing the number of heavy atoms from 4 to 8 heavy atoms. It was found that 6 heavy atoms are reasonable because the largest fragment usually has around 13 heavy atoms, e.g., right in the center of the range. Fig. S3 and S4, and S5 and S6† present the results of Runs #2 and #3, respectively, to confirm the reasonable LIME reproducibility. The number of fragments of each run for the Clintox and Sider datasets is shown in Table 3.
Fig. 2 Structural alerts that influenced the increase in toxicity for all the 12 tasks in the Clintox dataset (Run #1). The order does not represent the toxic influence of each fragment. The other two runs are presented in the ESI.† |
Fig. 3 Structural alerts that influenced the increase in toxicity for all the 12 tasks in the Sider dataset (Run #1). The order does not represent the toxic influence of each fragment. The other two runs are shown in the ESI.† |
Dataset | Run | Unwanted fragments | Number of found unwanted substructures | Number of compounds without unwanted substructures |
---|---|---|---|---|
Clintox | #1 | 109 | 147 | 4183 |
#2 | 121 | 163 | 4174 | |
#3 | 113 | 110 | 4218 | |
Sider | #1 | 113 | 237 | 4165 |
#2 | 133 | 311 | 4123 | |
#3 | 114 | 380 | 4043 |
Table 3 also shows the results of the filtering of likely unwanted fragments obtained using the ECFP fingerprints in the EGFR dataset with compounds that show high binding affinity to EGFR. The dataset has 4266 compounds with RO5 and PAINS compliance, e.g., after Lipinski rules of 5 (RO5) and PAINS filtering were applied to the original EGFR dataset. The number of unwanted substructures found by LIME and the number of compounds without unwanted substructures are presented to confirm the filtering of structural alerts in the EGFR dataset. Fig. 4 and 5 present the structural alerts generated by using LIME. These structural alerts are highlighted in the first active molecule of some tasks for the Clintox and Sider datasets (Run #1 in each dataset). Fig. S7 and S8, and S9 and S10† show the other two runs (Runs #2 and #3) for the Clintox and Sider datasets, respectively, which are presented in the ESI.† As observed, LIME interpreted the datasets and found some structural alerts. It would be difficult for human beings to propose them. However, unique knowledge-based expert web-based platforms or software may be used for suggesting structural alerts for toxic compounds that may cause adverse drug reactions.5,6,48,87,88 It is important to mention that most structural alerts are present in different runs, and some are like each other, showing the LIME robustness.
Fig. 4 Representation of several fragments highlighted in the first active molecule of some tasks for the Clintox dataset (Run #1). The other two runs are presented in the ESI.† The number in the first column is the index in the data frame. |
Fig. 5 Representation of several fragments highlighted in the first active molecule of some tasks for the Sider dataset (Run #1). The other two runs are shown in the ESI.† The number in the first column is the index in the data frame. |
The potential of the structural alerts found here in this study was assessed based on the frequency of fragments interpreted using the toxicological information of thousands of compounds found in Tox21, Clintox, and Sider datasets. From the point of view of statistics, it is necessary to perform the search in a large dataset with hundreds or thousands of compounds to make the frequency of fragments significant. Therefore, the structural alerts found in this investigation were used to filter 50 to 250 chemical compounds in the large dataset with 4266 EGFR inhibitors. Thus, the structural alerts proposed here are suitable to filter chemical compounds as toxic alerts. However, it is important to emphasize that the structural alerts are only toxic alerts to flag potential toxic compounds and serve to help organic synthetic and pharmaceutical researchers on toxicology issues of complex chemical compounds in both academics and industry.
The attachment in the end of the ESI† shows key structural alerts found here using datasets of marketed drugs with adverse drug reactions58 and qualitative datasets of drugs approved by the FDA57 that have failed clinical trials for toxicity reasons. Many identified fragments for toxic alert were similar and sometimes identical within the 12 tasks in both Clintox and Sider datasets, showing the robustness of the method. However, it is important to mention the LIME limitations as follows.
Meanwhile, the fragments CCC[C@H](C)C and CCCC(C)C were interpreted as different substructures. However, they both may be part of the same molecules, and thus it is possible to indicate they are redundant fragments. Unfortunately, the explainable model used here to get structural alerts does not consider or is not able to solve this issue. LIME can generate dozens of substructures that are probably related, but further work needs to be performed to eliminate the redundant substructures. The other methods below also present issues on redundancy.
Another issue in structural alerts is the existence of two substructures in the same molecule. LIME does apply to this issue. There are some methods that also consider the existence of two or more substructures in the same compound.40 The emerging pattern40 is a method widely used, where a molecular pattern is identified as a set of molecular fragments. The jumping emerging pattern89 is a mining algorithm to find the patterns assigning atom pairs as descriptors. Then, the emerging pattern mining method uses the contrast pattern tree algorithm to find toxic features.90 Another issue may probably come from a structural alert generated from a dataset with false positives. The explainable model to find structural alerts may not correctly interpret the effect of existing redundant structural alerts in a set of false positive compounds. Although some methods exist, avoiding false positives in datasets is still challenging. Therefore, the future development of explainable models to find structural alerts should consider the generalization of specific structural alerts to avoid redundancy.
It is important to mention about the stability and consistency of LIME. The canonical SMILES representation used here ideally guarantees exclusive codes to uniquely define molecules.68 Despite this, molecules with different SMILES representations can be the same in practice. This occurs especially in 2-hydroxy-2,3-dimethylpentane, 2,2-dimethyl-3-hydroxypentane and 2-hydroxy-3-ethane-2,3,4-trimethylpentane which appear in the results of Clintox and Sider datasets. It is important to point out that these structural alerts are fragments of molecules. Although they are the same, they can be found in different spatial dispositions, being understood by LIME as different fragments. Fig. 6A shows the repeated structures in the result for the Clintox dataset (Run #1). For Runs #2 and #3, the repeated structures are shown in Fig. S11 and S12 in the ESI,† respectively. For the Sider dataset (Run #1), the repeated structures are shown in Fig. 6B. Runs #2 and #3 have the same repeated structures as compared with Run #1.
Fig. 6 Representation of repeated structures in (A) Clintox and (B) Sider datasets (Run #1). The representations of the other two runs are shown in the ESI.† |
From the SMARTS patterns (available in the end of the ESI†), it is possible to verify the similarity between the three runs of Clintox dataset and the three runs of Sider dataset only to analyze the consistency of the LIME model applied in these datasets. Taking the 109 structural alerts generated by the Clintox dataset (Run #1), 98 appear in Run #2 and 89 are repeated in Run #3, corresponding to 90% and 82% similarities, respectively. Furthermore, Runs #2 and #3 have 96 similar structural alerts, corresponding to 79% and 85% of their structures, respectively.
Analyzing the 113 structural alerts generated by Run #1 in the Sider dataset, 96 structures are repeated in Run #2 (85% similarity) and 85 in Run #3 (75% similarity). In addition, Runs #2 and #3 have 95 similar structural alerts, corresponding to 71% and 83% of their structures, respectively. Finally, it is important to analyze all structures generated for the Clintox dataset (Runs #1, #2 and #3), excluding the repeated ones (total of 143 alerts), and for the Sider dataset (Runs #1, #2 and #3), excluding the repeated ones (total of 160 alerts). It is observed that there are only 38 non-repeated structures, showing high similarity and consistency between both results. For better visualization, Fig. S13–S15† present these data discussed here.
One example of the proof-of-concept experiments will be predicting if a molecule that contains mutagenic fragments such as an aromatic ring, nitro group, nitrosamine group or an epoxide group would be caught by LIME. Different from toxicity tasks, these experiments should have a very clear ground truth for explanations. LIME should highlight these groups and it is a perfect experiment to see if LIME could pick the correct structures in such tasks. To prove this concept, we performed a preliminary LIME analysis using the Bursi mutagenicity dataset. It is a reasonably large data set, containing 4337 molecular structures with the relative Ames test results. We found many mutagenic fragments, including the C–CC group found in the aromatic ring, nitrosamines, (poly)halogenated compounds, thiols, nitro compounds, epoxides, etc. It is important to mention that all these groups are very well recognized as mutagenic. The top-5 mutagenic compounds found in this preliminary experiment are presented in Fig. 7.
Fig. 7 The top-5 mutagenic compounds found in the LIME analysis applied to the Bursi mutagenicity dataset choosing fragments larger than one heavy atom. |
LIME | Local Interpretable Model-Agnostic Explanations |
SHAP | SHapley Additive exPlanations |
FDA | Federal Drug Administration |
ECFPs | Extended Connectivity FingerPrints |
EGFR | Epidermal Growth Factor Receptor |
RO5 | Lipinski Rule of Five |
PAINS | Pan-Assay Interference compounds |
NR | Nuclear Receptor |
SR | Stress Response |
NR-AR | Androgen Receptor using the MDA cell line |
NR-AR-LBD | Androgen Receptor Ligand Binding Domain |
NR-AhR | Aryl hydrocarbon Receptor |
NR-Aromatase | Aromatase enzyme |
NR-ER | Estrogen Receptor α using the BG1 cell line |
NR-ER-LBD | Estrogen Receptor α Ligand Binding Domain |
NR-PPAR-γ | Peroxisome Proliferator-Activated Receptor γ |
SR-ARE | Antioxidant Response Element |
SR-ATAD5 | Luciferase-tagged ATAD5 in human embryonic kidney cells |
SR-HSE | Heat Shock Response |
SR-MMP | Mitochondrial membrane potential p53 response |
SR-p53 | p53 response |
RO | Receiver Operating Characteristics |
AUC | Area Under Curve Statistical Method |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2dd00136e |
This journal is © The Royal Society of Chemistry 2023 |