Jan
Wollschläger
* and
Floriane
Montanari‡
Machine Learning Research, Bayer Pharmaceuticals, Müllerstr. 178, Berlin, 13353, Germany. E-mail: jan.wollschlaeger@bayer.com
First published on 30th July 2024
The solubility in a given organic solvent is a key parameter in the synthesis, analysis and chemical processing of an active pharmaceutical ingredient. In this work, we introduce a new tool for organic solvent recommendation that ranks possible solvent choices requiring only the SMILES representation of the solvents and solute involved. We report on three additional innovations: first, a differential/relative approach to solubility prediction is employed, in which solubility is modeled using pairs of measurements with the same solute but different solvents. We show that a relative framing of solubility as ranking solvents improves over a corresponding absolute solubility model across a diverse set of selected features. Second, a novel semiempirical featurization based on extended tight-binding (xtb) is applied to both the solvent and the solute, thereby providing physically meaningful representations of the problem at hand. Third, we provide an open-source implementation of this practical and convenient tool for organic solvent recommendation. Taken together, this work could be of benefit to those working in diverse areas, such as chemical engineering, material science, or synthesis planning.
Modelling of solubility can be roughly divided into physical, data-driven and hybrid approaches. Physical approaches aim to describe actual physical interactions involved in the observed solubility from first principles.8 Purely data-driven approaches lie on the other end of the spectrum and aim to employ statistical models without a priori knowledge of the physical effects involved. Due to its importance in drug discovery considerations, aqueous solubility prediction has been a major subject of focus. Hence, a plethora of approaches such as fragment-based,9 molecular-dynamics,10 general-solubility equation,11–13 Hansen solubility parameters and Hildebrandt solubility parameters,14,15 and first principle ab initio calculations have been reported.16–19 More recently, we see a switch of focus towards quantitative structure property relationships (QSPR) via statistical techniques.20–24 The Open Notebook Science challenge25 is one of the few data sources of organic solubility (i.e., solubility in solvents other than water), but has received little attention in comparison to the aqueous case.26,27
Vermeire et al.28 reported on a combined physical/ML approach that is applicable to a wide range of organic solvents and temperatures (RMSE ≈ 0.89, MAE ≈ 0.62). A recent communication29 reported on a hybrid approach towards organic solubility, that combines machine learning (ML) and ab initio calculations. In this study, a focused set of 14 physicochemical descriptors were chosen based on their significance in influencing the dissolution process. These descriptors encompass both properties derived from ab initio computations as well as the experimentally determined melting point. With this setup, an error of RMSE(logS) ≈ 0.7 was obtained, which is close to the experimental accuracy of 0.7–1.0 log units for drug-like molecules.5,30 While this work does illustrate the advantages of hybrid physical/ML modelling, it is still somewhat restricted from a practical point of view: it requires expensive ab initio calculations, supports a limited set of solvents (acetone, benzene, and ethanol), and requires the experimental melting point.
Vassileiou et al.31 further support the observation that hybrid physical/ML approaches lead to improved predictive performance. Here, the conductor-like screening model for real solvents (COSMO-RS) with RMSE(logS) ≈ 1.0 is critically compared against machine learning on 2D descriptors (RMSE(logS) ≈ 0.92), as well as hybrid models thereof, with hybrid models (RMSE(logS) ≈ 0.8) giving the best results. While the set of solvents was extended over ref. 29, it is still restrained to a limited set, and also shares the requirement of an experimental melting point as well as expensive electronic structure calculations.
In our quest to implement a more convenient solvent recommendation tool, we thus asked ourselves two questions:
Could the expensive ab initio calculations be replaced by faster approximate methods?
Are experimental melting points strictly required to produce accurate QSPR solubility predictions?
To replace the expensive ab initio models, a computationally efficient yet accurate solvation model based on the analytical linearized Poisson–Boltzmann (ALPB) parameterized for the extended tight binding (xtb) method32 was selected as a replacement for the more expensive COSMO-RS used in prior studies. The proposed method performs well over a broad range of systems and applications. For hydration free energies of small molecules, xtb(ALPB) is reaching the accuracy of sophisticated explicitly solvated approaches, with a mean absolute deviation of only 1.4 kcal mol−1 compared to experiment.32 While this compares favorably against COSMO (MAE = 2.19 kcal mol−1), it is worse than the computationally more elaborate COSMO-RS (MAECOSMO-RS = 0.52 kcal mol−1).33
But logarithmic octanol/water partitioning coefficients are computed with a MAE ≈ 0.65 log units,32 which is comparable to the performance of COSMO-RS with MAE ≈ 0.57.34 This indicates that xtb gives a consistent description of differential solvent effects,32 which we regarded as crucial for hybrid physics/ML descriptors in organic solvent recommendation. Concerning the second question, it has to be noted that the solubility of molecules in their solid state depends not only on the energy gained by solvation, but also upon the energy required for breaking up the crystal lattice in the solid.30 Therefore, including these solid state effects into the solubility modelling would likely lead to significant improvements. However, the modelling of the solid–solid interactions proves to be exceptionally difficult, as it depends on the minute details of the geometrical arrangement of the molecules in the crystal lattice.30 This can also be seen by the large errors in melting point prediction RMSE(ΔTmp) ≈ 40 − 50 °C, even though the experimental error is probably less than 5 °C.35,36 Although prohibitively computationally expensive at the moment, it seems quite likely that ab initio crystal structure prediction (CSP) could be involved to model the solid state crystal lattice energy contribution to solubility phenomena in the long term.37 As solubility differences between polymorphs (2-fold) are considerably lower than errors in solubility prediction (5- to 10-fold), it might not even be required to identify the correct polymorphic form.38 However, for the time remaining for such CSP tools to become feasible for routine use, the exact structure of the crystal phase remains elusive.
This raises the question whether it is possible to pose the solubility problem in a manner that eliminates the need for a description of the solid state, motivating the main idea behind the present work: modelling organic solubility/solvent selection as a ranking problem.
Temperature variations are ignored, and hence contribute to the overall error of the model. But, with a temperature span of 20–30 °C, with most data points recorded at 25 °C, we believe the system to be well-calibrated to typical lab room temperature conditions.
As only provides qualitative labels for solubility, the following ordinal encoding was applied: partially soluble: (1), thermally soluble: (2), kinetically soluble: (3), readily soluble: (4). Entries where either solubility information, solvent SMILES or solute SMILES are missing were removed. Data from Bayer's internal solvent screening lab is also qualitative, and was converted into an ordinal numbering encoding in the same manner as the source .
Next, only those data points were kept that contain measurements in at least 3 different solvents. The distribution of solvents after this preprocessing is shown in Table 1 of the ESI.† Only the most frequent 60 solvents were selected to ensure a sensible coverage within the training data and removal of unusual solvents.
However, similar results are obtained (see ESI,† evaluation on full dataset) on the full dataset. Models provided in the open source package are trained on the full dataset ( + ).
cddd: continuous and data-driven molecular descriptors47 were selected because of the generality of this molecular representation, as well as the fact that the logD helper task applied during training of cddd is chemically strongly related to the challenge of relative solubility.
cosmors: the conductor-like screening model for real solvents44 as implemented in open-COSMO-RS45 was selected because it is a de facto standard in organic solubility prediction.
ecfp_bit/count: extended-connectivity fingerprints in both bit-based and count-based flavors were considered as another commonly used molecular representation baseline.
rand: a uniform random real input, thus representing a random coin toss. Has no information about the problem and should thus exhibit no predictive performance.
prior: a random enumeration of the solvent classes, corresponding to the prior baseline (order all solvents according to their overall solubility on all datapoints, e.g. DMSO > hexane).
vermeire: the model described recently by Vermeire et al.28 is used as a baseline for current state-of-the-art performance in organic solubility prediction. Refers to a distinct model.
xtb: at last, we compare an ET trained on our proposed hybrid physical/ML features, as described in the section “XTB hybrid features”.
For each of these features, three different types of solubility relation were considered (where applicable): absolute refers to an absolute solubility model, while relative_concat and relative_diff describe relative solubility models employing concatenation or subtraction as the reduction operation (refer to Fig. 4).
ρ (rand): random splits that take the solute SMILES column into account, such that every solute compound is assigned into a single CV-fold.
ρ (Butina): Taylor–Butina clustering is based on exclusion spheres at a given Tanimoto similarity level.48,49 The way the clusters are built allows all of the molecules belonging to each cluster to have a Tanimoto value above or equal to the similarity cutoff used. During each iteration, molecules are visited and labeled as either cluster centroid or as a cluster member. As such, it is an unsupervised non-hierarchical clustering algorithm that guarantees that every cluster contains molecules which are within a distance cutoff of the centroid molecule, and was chosen here as a method to generate more challenging splits that also take the chemical similarity into account. Butina clustering was performed using the RDKit with distance_threshold = 0.4.
ρ (solvent): random splits on the solvent SMILES column have been performed in such a way that each solvent is randomly assigned to one CV-split, beginning with the most frequent solvents first to balance the size of the CV-folds. This provides a very challenging generalization task, as predictions have to be made on solvents that have not been encountered during training.
ρ (ood, nova): an out-of-distribution evaluation was performed on a qualitatively-labelled dataset available in the public domain by Pillong et al.39
ρ (ood, bayer): an out-of-distribution evaluation was performed on a qualitatively-labelled Bayer-internal dataset.
All reported results (except vermeire) refer to extremely randomized trees (ET) regressor models, evaluated within an outer five-fold cross-validation. Error bars correspond to the variation across CV folds.
A simple way to establish such a ranking that adheres to the outlined requirements is to compute the mean of the signed edge weights over each node. For each solvent node the weight of each incoming edge is added while all outgoing edge weights are subtracted, and finally divided by the overall number of edges:
This simple yet effective algorithm was compared against the ranked pairs53 and the PageRank54 algorithms as two alternatives for establishing a linear ranking.
Both alternatives showed degraded predictive performance (see Fig. S6 and ESI†) over the simple procedure outlined above. Hence, we focused on the simple mean method outlined here.
We acknowledge that deeper investigation into the details of these algorithms (e.g. optimizing the damping factor) might improve the performance so that the superiority of the mean algorithm can not be finally concluded.
Analysis of the logS values show that the range in water is −11.7 to 1.01 with a mean of −3.05, while the logS-range of organic (non-water) solvents reaches from −7.30 to 1.11, with a mean of −0.826. Molecular weight (MW) of the open notebook dataset was found to be largely normally distributed centered on MW ≈ 200.
This low molecular weight average already indicates that does not reflect the typical chemical space of active pharmaceutical ingredients.
In order to obtain a more realistic comparison for drug-like chemical space, it was therefore decided to include two additional datasets: a qualitatively-labelled organic solubility dataset,39 referred to as for the remainder of this text, and a Bayer-internal organic solubility dataset, referred to as hereafter. Table 1 summarizes the three different data sources used in this work. A more detailed listing of the solvents can be found in the ESI.† There are no overlaps between the different datasets except for one solute with the SMILES that is contained in both and . Known drug-like chemical space encompasses compounds with MW ≤ 800 g mol−1.55 In accordance with this, both pharmaceutical datasets exhibit significantly larger average molecular weights, as shown in Fig. 3A. Furthermore, a 2D uniform manifold approximation and projection (UMAP) analysis56 (see Fig. 3B) on the count-ECFP representation shows that the dataset covers only a small part of the drug-like chemical space of the dataset.
Which raised the question: Is it possible to train models on the large, but low MW, training corpus that generalize towards broader (high MW) chemical space as exemplified by the drug-like datasets and ?
One challenge that we encountered is that only a limited set of solvents is supported, and that the method cannot be trivially expanded towards a broader solvent selection.
To work around this limitation, we decided (see Fig. 4A) to compute 11 physical properties as exposed by alpb-xtb for all 23 solvents supported. Most important properties include energy of solvation Gsolv, partial contributions Ghb, Gsasa thereof, HOMO–LUMO gap ΔHOMO/LUMO, and dipole moment μsolu,solv (see ESI† for a listing of all features). Thus, a 11 × 23 = 253-dimensional vector is obtained for the solute and solvent part, respectively. As an added advantage, this formulation gives us a simple approximative way of representing solvent mixtures: as a linear combination of the vector representation of each mixture component with their corresponding fraction in the mixture as multiplicative coefficient.
Both solvent and solute components are concatenated to yield a feature vector totalling 506 dimensions. For the absolute solubility recommender (see Fig. 4B), the feature vector is used as-is. For the differential solubility, an edge (solvent a, solvent b) is represented by either subtracting or concatenating the solvent parts of the feature vector, thus arriving at a differential feature representation for the solvent pair (see Fig. 4C).
The same construction was also applied to all other features (e.g. ecfp, cddd), to obtain relative_diff and relative_concat flavors of these features in the same manner.
One possible explanation could be that random feature splits characterizing the ET lead to smoother regression outcomes, as compared to bagging approaches.57 Based on these results, we chose extremely randomized trees (ET) as the regression model for all other experiments. Next, ETs with different feature types and solubility relations were evaluated within different data splits (see Methods section for additional details). The results of the performance evaluation are summarized in Table 2. Throughout the table, the best performing models have been highlighted in bold typeface, under consideration of the reported errors.
Feature | Relation | ρ (rand) | ρ (Butina) | ρ (solvent) | ρ (ood, nova) | ρ (ood, bayer) |
---|---|---|---|---|---|---|
a Refers to a distinct model. | ||||||
rand | absolute | −0.015 ± 0.1 | 0.021 ± 0.1 | −0.043 ± 0.1 | −0.018 ± 0.1 | −0.087 ± 0.2 |
prior | absolute | 0.397 ± 0.03 | 0.386 ± 0.05 | 0.03 ± 0.09 | 0.293 ± 0.02 | 0.582 ± 0.01 |
ecfp_bit | absolute | 0.403 ± 0.08 | 0.282 ± 0.1 | 0.091 ± 0.2 | 0.263 ± 0.05 | 0.312 ± 0.06 |
relative_concat | 0.509 ± 0.08 | 0.484 ± 0.1 | 0.575 ± 0.1 | 0.662 ± 0.07 | 0.361 ± 0.09 | |
relative_diff | 0.605 ± 0.05 | 0.481 ± 0.1 | 0.464 ± 0.3 | 0.715 ± 0.02 | 0.578 ± 0.04 | |
ecfp_count | absolute | 0.55 ± 0.08 | 0.463 ± 0.06 | 0.273 ± 0.1 | 0.296 ± 0.03 | 0.581 ± 0.03 |
relative_concat | 0.576 ± 0.05 | 0.53 ± 0.06 | 0.482 ± 0.4 | 0.74 ± 0.01 | 0.589 ± 0.03 | |
relative_diff | 0.546 ± 0.05 | 0.509 ± 0.02 | 0.469 ± 0.3 | 0.733 ± 0.01 | 0.618 ± 0.03 | |
cosmors44,45 | absolute | 0.554 ± 0.04 | 0.484 ± 0.03 | 0.474 ± 0.09 | 0.327 ± 0.02 | — |
relative_concat | 0.602 ± 0.04 | 0.567 ± 0.06 | 0.542 ± 0.2 | 0.673 ± 0.04 | — | |
relative_diff | 0.614 ± 0.03 | 0.569 ± 0.04 | 0.562 ± 0.1 | 0.641 ± 0.01 | — | |
vermeire28a | absolute | 0.645 | — | — | 0.554 | 0.612 |
cddd 47 | absolute | 0.605 ± 0.04 | 0.569 ± 0.03 | 0.279 ± 0.2 | 0.345 ± 0.04 | 0.608 ± 0.01 |
relative_concat | 0.657 ± 0.02 | 0.608 ± 0.06 | 0.594 ± 0.2 | 0.744± 0.01 | 0.638 ± 0.01 | |
relative_diff | 0.657 ± 0.03 | 0.602 ± 0.06 | 0.569 ± 0.3 | 0.733± 0.01 | 0.628 ± 0.01 | |
xtb | absolute | 0.591 ± 0.05 | 0.554 ± 0.07 | 0.5 ± 0.06 | 0.322 ± 0.02 | 0.602 ± 0.01 |
relative_concat | 0.658 ± 0.03 | 0.604 ± 0.08 | 0.572 ± 0.3 | 0.736± 0.01 | 0.644 ± 0.01 | |
relative_diff | 0.638 ± 0.02 | 0.612 ± 0.06 | 0.578 ± 0.2 | 0.703 ± 0.02 | 0.647 ± 0.01 |
The most significant observation is that relative models outperform the absolute solubility models throughout all features and evaluation scenarios considered.
We will now discuss the results of the different evaluation scenarios in more detail.
First, we consider the evaluation within random solute splits. In Fig. 5B, the ranking performance on random splits is shown for different features, which corresponds to column ρ (rand) of Table 2.
A random null model trained on a uniform random variable () has no information about the outcome. Consequently, it exhibited the lowest performance, as it cannot possibly arrive at a sensible ranking decision. With the solvent prior order, a sharp jump of 40 percent points was seen, as compared to a random coin toss. This aligns with chemical intuition, as some solvents are intrinsically more suited than others (e.g. DMSO is in general a better organic solvent than hexane). The highest performing features are cddd, xtb and the vermeire model. However, for vermeire, leakage of the test data has to be considered, as 54.4% of solute compounds considered here were part of the SolProp dataset58 used in training this model.28 Therefore, the reported ρvermeire,rand = 0.645 provides an optimistic upper bound of the true performance. At first sight, it might seem surprising that the fast approximative xtb features lead to better results than the more sophisticated/computationally expensive COSMO-RS. However, we would like to point out, as noted before, that the alpb/xtb combination already reaches competitive accuracy with regards to differential solubility.32 Furthermore, the observation that pure machine learning models outperform COSMO-RS on a similar task31 are in agreement with this finding.
Overall, the most significant trend is that the relative solubility framing consistently improves performance across all features.
Next, Butina clustering was employed to obtain more challenging CV splits (see Fig. 6A). Similar trends were observed, with the differential frame again outperforming the absolute solubility frame. Comparing these results with the random splits, a more significant drop in performance is registered for the ecfp features, while cddd, cosmors, and xtb seem to generalize better across chemical space.
To investigate the generalization capability towards novel solvents, a split was performed by randomly assigning solvents towards different CV folds (Fig. 6B). This solvent split provides a very challenging generalization task, as models are required to train on one set of solvents and then predict towards a new set of solvents. Again, the differential solubility models trained on xtb and cddd features performed best, and largely retained the ranking performance on novel solvents as compared to random splits. The higher variance observed for the solvent split could be attributed to a mixture of two effects: firstly, the fact that some solvents exhibit more unique solubility characteristics, thereby introducing more variation. Secondly, the varying amount of training data available for each solvent (see Table S1 and ESI†). The first explanation seems more likely, because data was stratified in such a way that the top solvents are equally distributed among the splits. This is further corroborated by the observation that solvents with unique solubility characteristics contribute positively to the prediction accuracy in the more detailed per-solvent error analysis (vide infra).
The final two plots compare the ranking performance when training on the low-MW dataset and evaluating on the higher-MW (Fig. 6C) and (Fig. 6D) datasets.
Referring back to Fig. 3A, it is apparent that dataset is chemically more similar to while the bayer data is characterized by a even larger distance in the mean MW.
It is therefore reasonable that most models outperformed the prior on the nova ood evaluation (Fig. 6C), while only minor improvements over the prior were observed for the bayer data (Fig. 6D). While the large difference of over 300 g mol−1 between and provides a reasonable explanation for the low generalization performance, it is nonetheless important to acknowledge that discrepancies in data quality between the two out-of-distribution datasets cannot be discounted. Due to the high runtime requirements combined with the fact that the COSMO/COSMO-RS calculations failed for many inputs, the evaluation discussed here is restricted on the subset of datapoints where COSMO-RS calculations were successful, to ensure a fair comparison. However, the evaluation on the full dataset (see ESI, p. 3†) shows the same overall trends. In the end, the xtb + relative_concat was chosen, as it is the best performing feature combination in general, and furthermore also performed best in the ood evaluations, which are arguably the practically most relevant evaluation scenarios.
Overall, when summarizing the results across all evaluation scenarios, the most significant enhancement arises from the relative solubility framework. This framework consistently boosts ranking performance across a diverse array of feature selections. Notably, it is remarkable to observe that a variety of features, including sparse fingerprints (ecfp_bit, ecfp_count), dense pretrained embeddings (cddd), and physical descriptors (cosmors, xtb), consistently exhibit improved ranking performance within the relative solubility frame. This observation reinforces the initial concept (see Fig. 1) that excluding the solute/solid state from the solubility equation offers a practical and widely applicable approximation.
From another perspective, the question can be formulated: What solubility differences ΔlogS can we sensibly require to be ranked? To answer this question, a minimum solubility threshold, ΔlogSmin, was defined. Then, all measurements for which there exists another measurement for the same solute, such that ΔlogS < ΔlogSmin, have been removed from the test set at random. Thus, for each solute that is to be ranked, similar solvents below the threshold do not need to be ranked with regards to each other. Thereby, the ranking task is simplified as it only focuses on solvents with significant solubility differences.
As shown in Fig. 7, most of the improvement in the ranking performance is seen from ΔlogSmin = 0.0 → 1.0 which fits to the experimental solubility errors of 0.7–1.0 log units.30,59
It can be concluded that the ranking performance without any ΔlogS filtering are pessimistic, lower-bound, estimates. Furthermore, thresholds in the range ΔlogSmin = 0.5 − 1.0 should be applied, below which the models ranking performance degrades significantly.
Fig. 8 Schematic overview of the Solvmate solvent recommendation web app taking the list of solvents as IUPAC names and the solute structure as input, and producing a solvent ranking by solubility together with the most similar database matches (in the training dataset) as output. The OPSIN software40 is used for parsing IUPAC names into chemical structures. Solutes are either entered as SMILES or via a molecular structure editor. |
On the output side, a ranking of the solvents is shown as box plot, ordered from best solvent on the top to worst solvent on the bottom. The estimated uncertainty is obtained from the variance of an ensemble of five model instances trained on bootstrapped subsamples of the training data.
Furthermore, the recommender also displays the matching solubility data points in the training corpus that are closest in the feature space.
Fig. 9 Two examples showcasing problems encountered when producing solvent rankings. Parity plots for the solvent ranking of diclofenac (A), as reported in ref. 60 and dioxopromethazine HCl (B), as reported in ref. 61. Correct ranking positions are shown in gray, errors by one position in orange, and errors by more than one position in red. An ideal ranking would yield the dotted diagonal line. |
Footnotes |
† Electronic supplementary information (ESI) available: Additional statistics about the different datasets, example ranking outputs, xtb configuration, and the feature importance analysis. Furthermore, datasets used in this study are uploaded as .csv-files to facilitate reuse. See DOI: https://doi.org/10.1039/d4dd00138a |
‡ Current affiliation: OWKIN, Paris, France. |
§ Dataset available under the url: https://figshare.com/articles/dataset/Open_Notebook_Science_Challenge_Solubility_Dataset/1514952. |
¶ Found here: https://github.com/TUHH-TVT/openCOSMO-RS_conformer_pipeline. |
|| Possible users of the system include chemists, pharmacists, chemical engineers, material scientists, lab technicians, …. |
This journal is © The Royal Society of Chemistry 2024 |