Tiago
Rodrigues
*
Instituto de Investigação do Medicamento (iMed), Faculdade de Farmácia, Universidade de Lisboa, Av. Prof. Gama Pinto, 1649-003 Lisbon, Portugal. E-mail: tiago.rodrigues@ff.ulisboa.pt
First published on 30th March 2022
Small molecules remain a centrepiece in molecular medicine. Specific drug target recognition is an unavoidable requirement for their successful translation to the clinic. While testing chemical matter for affinity and potency is the mainstay, only more recently have the chemical biology and medicinal chemistry communities become more profoundly aware of potential attrition and development pitfalls linked to artifactual readouts. Auto-fluorescence, assay interference and colloidal aggregation are the most common sources of false positive hits in screening assays and may divert drug development pipelines toward unfruitful research avenues. In this Perspective, I dissect how computational tools, in particular machine-learning pipelines, can accelerate the development of chemical probes and drug leads by expediting the identification of liable chemical matter. Further, I conceptualize anomaly detection for chemical biology and highlight limitations to a more productive deployment of machine learning. By solving pressing challenges, one might gradually mitigate the impact of nuisance compounds in drug discovery.
The identification of viable starting points for elaboration in hit-to-lead and lead development campaigns is a focal point in drug discovery. Toward that end, screening assays must be set up,7 wherein the binding affinity or functional activity is assessed – initially in a single concentration and then in a follow-up concentration–response curve. It is worth noting that hits in primary assays do not always reflect true target recognition (e.g. ref. 8). To mitigate this, control and orthogonal assays with disparate detection technologies are executed.9 These however only partly confirm the ligand–target interaction and are best suited to rule out interference with either the detection method (e.g. fluorescence) or assay components (e.g. proteins, reporter molecules).
In parallel, strict windows for lipophilicity and molecular weight values are routinely accepted to guide the design of ‘drug-like’ entities with decreased likelihood of promiscuity/attrition and improved oral bioavailability.10 One may appreciate that those rules are not sufficiently generalizable. Natural products remain the biggest source of inspiration for molecular design and approved drugs, but present a vaster property space than what is commonly accepted.11 Readily discarding such chemotypes would have had unpredictable consequences in modern medicine. Taken together, not all structural liabilities can be identified through rule-based intuition, which constitutes one of the grand and unsolved challenges in small molecule development.
While there is awareness regarding the perils of promiscuity, the underlying molecular bases have remained somewhat abstract and associated to unwritten rules of expert knowledge. The so-called frequent-hitters12 tend to appear as promising prototypical structures and modulators of unrelated targets, but should indeed be examined with caution. The mechanisms of interference can be very diverse. For example, cationic molecules can induce vesicle-like structures in cells and foamy membranes – a process called phospholipidosis.13 On occasion, metal impurities at trace level can also result in apparent bioactivity. This is an important realization given that palladium-catalysed reactions are among the most employed in medicinal chemistry.14,15 The most common interference mechanism is however the formation of aggregates, which can occur even in focused libraries with apparently sound structure–activity relationships (Fig. 1).16 On a physical level, colloidal aggregation results from the poor aqueous solubility of a given molecule. The formed nano or microscale particles can inadvertently denaturate proteins.17–19 Protein denaturation is thus the crux of false target modulation. Since that realization, dynamic light scattering (DLS) and paired biochemical screens (with and without surfactant) have become the gold standard for detecting anomalous biological readouts.20,21 Other technologies, e.g. nuclear magnetic resonance,22 surface plasmon resonance23 and centrifugation,24 may also be employed, but rapidly become cumbersome from an experimental and data analysis vantage point. It is here that in silico tools can expedite the identification of liable matter. From rule-based methods (e.g. PAINS,25 REOS26,27) to machine learning (ML), a multitude of workflows have been developed to help flagging, accelerate the detection and deprioritize less promising target effectors for medicinal chemistry elaboration.
In this Perspective, I not only critically discuss the impact of frequent hitters – in particular those forming aggregates in aqueous media – but also dissect the data sources, molecular representations and ML algorithms employed for triaging them. Through select literature examples, I wish to expose unsolved challenges in the area and speculate on potential solutions, such as anomaly detection algorithms, that can inform molecular design. Ultimately, I aim to rekindle an interest in the digital chemistry community toward the nuisance detection issue in molecular medicine. Albeit receiving less attention than retrosynthetic planning,28de novo design29 and others, confidently flagging problematic entities is challenging and relevant. If properly addressed, ML models can impact on the quality of the chemical matter in hand and enable more productive discovery chemistry and biology programs.
Frequent-hitters tend to show up as hits in myriad biochemical or cellular assays due to either promiscuous behaviour12 – which is usually linked to inflated lipophilicity or the so-called “molecular obesity”31 – or interference with the detection method, as in the case of auto-fluorescence, quenchers, chelators, reporter stabilization or non-specific reactivity.8,32,33 Redox cycling molecules can also provide false positive hits and flagging them is not a trivial task even through experimental methods.34 A correct intuition on the classification of individual small molecules as legit or potentially problematic drug prototypes requires many years of hands-on medicinal chemistry experience and familiarization with multiple and diverse cases of each type. Indeed, molecular scaffolds provide the blueprints for promiscuity, but their decoration can largely influence the behaviour of the small molecules in biological assays.35 Harnessing this knowledge through ML models is however not always straightforward. For example, there is a fundamental limitation in the prediction of frequent-hitter behaviour when its source is a metal impurity. Different compound batches may contain different trace amounts of metal, which ultimately influences the outcome of the biological assay. Aggregating data on a sample level and analysing hit rates for specific samples can be helpful, but is a solution not easily generalizable.36 In line with ML routines for other applications, data sources are critical. Correctly preparing them and assigning labels to compounds is a bottleneck of such automated workflows and processes.37 In the absence of sufficient data, extensive experimentation may be required, which is sometimes incompatible to project timelines and needs.
Leveraging corporate data from multiple drug development programs, Schneider and colleagues12 employed a balanced dataset containing 902 structures (479 substructurally diverse frequent hitters and 423 non-frequent hitters), wherein human drugs were assumed as non-frequent hitters. Although a motivated approach, more recent studies20,38 have shown that even approved drugs can lead to false positive readouts, especially at the typical high-throughput screening concentrations (10–30 μM). This highlights that context is key32 and a truly generalizable ML model for nuisance prediction may be out of reach if several important variables, such as assay medium, type of measured endpoint and test concentration are not accounted for. Nonetheless, featurizing all molecules with 120 atom type descriptors led to self-organizing maps that could correctly cluster ∼90% of the molecules according to their labels (Matthews Correlation Coefficient (MCC) = 0.8), and in which hydroxylated aromatic moieties were highly discriminative in the decision process.12 Similar performance was independently obtained with random forests and substructural fingerprints (Morgan 2, 1048 bits) while using highly curated corporate datasets.39 Indeed, Engkvist and colleagues built predictive models to flag compounds according to the screening assay technology of interest. For example, redox cyclers are likely to interfere with AlphaScreens but may not interfere with other detection technologies. This shows the identification of nuisances is a problem in high dimensional space, whose genesis may not only be solely structural but also technological.
Despite the inherent heterogeneity in publicly available datasets, they may also provide sufficient quality information to allow flagging promiscuous entities. As evidenced in the HitDexter tool,40,41 >300000 entities in PubChem annotated to >50 targets were used to build two extremely randomized tree models that discriminate between non-promiscuous and promiscuous/highly promiscuous compounds (Fig. 2). Different molecular representations provided MCC values between 0.44–0.61, with Morgan fingerprints (radius 2) being arguably the most effective for discerning patterns. It is worth noting that while promiscuous small molecules are typically undesirable, in some instances that may be a beneficial trait,42–44 provided that target engagement is elicited through a true molecular recognition mechanism. This is the case for several approved drugs that were predicted as promiscuous by HitDexter40 or other data analyses pipelines.45 Those results show that flagging systems serve the purpose of cautioning drug development, but will fare worse at motivating the exclusion of specific small molecules from advanced elaboration. A recent HitDexter version (v3)46 extends its domain of applicability by employing a multilayer perceptron. Most interestingly, the improved models consider different types of data (target and cell-based) for the predictions, which realistically covers all screening scenarios. On a test set comprising dark chemical matter, i.e. molecules that have been extensively tested against unrelated targets without showing any promising activity, HitDexter 3 was able to correctly identify highly promiscuous molecules in cell-based assays (MCC = 0.611), while using Morgan 2 fingerprints.
Fig. 2 Schematics of the HitDexter tool (https://nerdd.univie.ac.at/hitdexter3/) for the prediction of frequent hitters. HitDexter (v2.0)41 uses chemogenomic information as indexed in PubChem and extremely randomized tree models to distinguish non-promiscuous molecules from promiscuous or highly promiscuous matter – in biochemical assays – that would deserve additional investigations in medicinal chemistry programs. Individual models for different promiscuity-inducing mechanisms are available to the user. A more recent version (v3.0)46 extends the domain of applicability of the tool by leveraging a multilayer perceptron for the prediction of promiscuity in cell-based assays (MCC up to 0.648). It is argued that further model improvements will require consideration of the assay types and conditions, which poses an important roadblock in respect to data availability/scarcity and its homogeneity. |
The prediction of promiscuity may de facto illuminate drug design if new intuition is extracted. In a study by Swamidass and colleagues, a convolutional neural network was trained with PubChem data with the goal of predicting promiscuity, and informing chemists on which moieties are correlated with reactivity. However, a sensitivity of 24% suggests that a large portion of reactive molecules remain unnoticed to the deep learning method and that improvements are required to more reliably assist in decision-making.47 In another case, using ECFP4 to describe molecules and screening a battery of learning algorithms it was found that certain motifs could be associated with promiscuity.48 Still, one is also advised to practice caution in such analyses. A major challenge in modern ML is linking model-extracted knowledge from feature importance values with the physical phenomena they indirectly represent. Not too infrequently, there is a disconnect or an experimentally non-verifiable hypothesis that can divert attention or induce falsely generalizable conclusions. With recent advances on the interpretation of ML pipelines (e.g. ref. 49), together with the realization that extracted intuition is biased by both algorithms and molecular representations50 greater emphasis has been put into experimentation as means of verifying data patterns.51 That said, it would be critical to assemble a larger collection of screening compounds with homogeneously generated labels, e.g. accounting for test concentrations. Those datasets will ultimately enable the creation of more accurate and explainable/interpretable models that are currently less accessible.
In 2005, Feng et al.53 screened 1030 ‘drug-like’ molecules at 30 and/or 5 μM using both DLS and a detergent-based assay. The collected data enabled the construction of a naïve Bayesian and a recursive partitioning model that on average presented a misclassification rate of 26% for a random set of molecules. A random forest, on the other hand, was more proficient in detecting patterns (11% misclassification rate),53 but with a relatively high class imbalance the results should be examined with caution. The data however has motivated the development of more sophisticated models. Using topological molecular descriptors and support vector machines (SVMs), a model was trained wherein 1319 aggregators and 128325 non-aggregators worked as knowledge base.54 At such high class imbalance (i.e., 1% aggregators:99% non-aggregators), a high retrieval of non-aggregators is expected at random. One may argue that the results of a PubChem and MDDR compound screen with said SVM model are aligned with the background class frequency (97.9% predicted as non-aggregators for PubChem and 1.14% predicted as aggregators for MDDR compounds). This shows the importance of naïve predictions based on simple statistics and baseline models. The analysis does not refute the validity of the SVM since an independent pseudo-prospective screen retrieved 71% of known aggregators (12 out of 17 aggregators, corresponding to 100% of the validation set). One may argue that model improvement is possible/desirable, and that the currently available training data is likely the Achilles heel in SCAM prediction.
Recognizing that caveat and leveraging a recent surge of high quality SCAM data, ChemAGG55 was implemented as a tree-based model (XGBoost). ChemAGG utilizes molecular descriptors calculated from a training set with aggregators and non-aggregators at a 1:2 ratio. Irrespective of the employed features – physicochemical, fingerprints, pharmacophore or fusion representations – the method performed very well on both training and test sets (ROC AUC = 0.987 and 0.976, respectively), and was able to identify patterns widely accepted as predictive of colloidal aggregation, such as high clogP and the number of hydroxyl groups (cf.Fig. 2). In similar fashion, the SCAM detective tool56 was developed with a particular focus on assay conditions and their influence in confounding predictions. Further, the SCAM detective tackles an often-overlooked topic in ML – the quantification of the applicability domain and uncertainty. Ultimately, its goal is providing a better balance between precision and recall relative to its predecessors, while using ECFP6-like fingerprints and data from AmpC β-lactamase and cruzain inhibition assays. Most interestingly, a web application is freely available to the community and it will prompt alerts whenever the query molecules fall outside said domain of applicability, i.e. when predictions are inherently less confident. With identical concerns in mind, we have recently contributed DeepSCAMs.57 It leverages DLS data at a fixed and typical high-throughput screening concentration (30 μM) to predict the aggregating behaviour of small molecules in a defined aqueous buffer. Further, DeepSCAMs gauges the prediction (un)certainty through the label probability. The method employs both Morgan fingerprints (radius 3, bit size 2048) and physicochemical descriptors calculated for 916 molecules in the training set, and a feed-forward neural network architecture with three hidden layers (Fig. 3). Its performance compared favourably against competing methods and in a short survey comprising 15 seasoned medicinal chemists. Overall, it evidenced the ability to detect hidden and nuanced data patterns that ought to be more broadly challenged through experimentation (balanced accuracy = 78%; MCC = 0.58). For a more realistic proof-of-concept, DeepSCAMs was also prospectively queried with 65 diverse small molecules. Experimental DLS evaluation confirmed an accuracy of 80% – higher than the average for the expert panel (ca. 61%) – suggesting that this and similar workflows can serve as auxiliaries in the identification of potential compound liabilities. One may expect that future developments will focus on augmenting training datasets to push the boundaries of what is predictable and unknown under a ML lens.
It is known that SCAMs are agnostic to protein targets and that a multitude of chemotypes have the potential to disrupt pharmacology networks.21 Interestingly, independent studies have reached similar conclusions in regards to the likely prevalence of SCAMs in the literature, either through predictive models or random experimentation.53,55,57 Being estimated at one in every five molecules (∼20%), SCAMs are more common than desirable and likely to confound ML models for bioactivity prediction. Because ChEMBL data is routinely mined to build such models and ruling out entities with aggregation-based bioactivity profiles has historically not been practiced, some bioactivity prediction tools might be flawed. Aggregators in training data might perpetuate the physical traits that lead to false positive hits in screening assays if they are not eliminated in a data curation step.58 While DeepSCAMs, ChemAGG and SCAM detective have shown acceptable utility, it is true we must further expand our knowledge base. In doing so, one will be able to more reliably use in silico tools to flag problematic molecules.
Considering their estimated frequency,53,55,57 SCAMs can be considered ‘outliers’ or ‘anomalies’ from a data science perspective. In fact, seeing frequent hitters in general as ‘anomalies’ has been intuitively adopted in experimental practice, taking into account analyses of hit rates.36 One may thus speculate that SCAMs and other con artists present structural features that set them apart from true hits in a concentration- and assay-dependent manner. A toolkit comprising disparate algorithms is at our disposal to identify anomalies in semi-supervised or unsupervised fashion.59,60 Anomaly detection is an established concept in ML – e.g. for denoising images61 – but has been sparingly exploited in the chemical sciences. In rare examples, it was used to detect unexpected signal fluctuations in the context of analytical chemistry and process control.62,63 To the best of my knowledge, no proof-of-concept has been disclosed in discovery chemistry programs, wherein detecting anomalies/outliers/novelties may find wide applicability in experiment prioritization and as a driver for innovation.
Here, I argue the anomaly detection concept may be extensible to flagging nuisances. An isolation forest was built as a baseline for unsupervised SCAM prediction. I employed the training data and descriptor set used in DeepSCAMs, while assuming the same concentration and buffer dependence for the predicted readout. In short, isolation forests work by recursively partitioning data in a series of trees and separating (isolating) anomalous examples in the training process. The average number of splits required to isolate a training example indicates how different it is for the others. In practice, anomalies require a lower number of splits and are thus more easily separated relative to normal observations.
Considering a contamination of 29% (matching the class imbalance), an optimized model was able to identify only 27 ± 1% of confirmed SCAMs in the training data (repeated analyses: n = 20). Reducing dimensionality to 10 principal components – i.e., accounting for >99% of data variance – did not improve the algorithm's performance significantly (29 ± 0.7%). This indicates that new ways of representing small molecules should be investigated to better discriminate SCAMs/non-SCAMs or that variables controlling (non)-aggregation currently elude our perception. Most interestingly, challenging the ML model with an evaluation set of 31 and 34 SCAMs and non-SCAMs, respectively, led to the correct label identification in 52% and 79% of the cases. The result perfectly reinforces how challenging detecting nuisances is. It also suggests that further exploration of the concept is a reasonable avenue either by improving datasets, representations and/or experimenting different algorithms.
It is clear that ML has the potential to facilitate chemical discoveries at unprecedented pace, yet some topics – as the one focused on here – remain underexplored by the cheminformatics and data science communities relative to others. This does not mean a secondary role or lower importance, but highlights there are unsolved challenges that must be addressed in the first place. One of such challenges is the need for superior datasets and benchmarks in terms of examples with better link to experimental context and measured endpoints. Reporting high quality data, even if negative, will be key to start witnessing the implementation of improved ML models. While supervised ML has been the go-to approach, I have argued that unsupervised and in particular anomaly detection algorithms may provide a fresh and innovative vantage point onto data and discovery chemistry. I envisage that the automated mapping of nuisances, together with uncertainty estimation and integration with experimental context will ultimately enable the design of quality matter to interrogate biology. If efficient, those ML pipelines can lower the likelihood of attrition in translational studies.
This journal is © The Royal Society of Chemistry 2022 |