Kevin P.
Greenman
a,
William H.
Green
a and
Rafael
Gómez-Bombarelli
*b
aDepartment of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA
bDepartment of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, MA 02139, USA. E-mail: rafagb@mit.edu
First published on 4th January 2022
Optical properties are central to molecular design for many applications, including solar cells and biomedical imaging. A variety of ab initio and statistical methods have been developed for their prediction, each with a trade-off between accuracy, generality, and cost. Existing theoretical methods such as time-dependent density functional theory (TD-DFT) are generalizable across chemical space because of their robust physics-based foundations but still exhibit random and systematic errors with respect to experiment despite their high computational cost. Statistical methods can achieve high accuracy at a lower cost, but data sparsity and unoptimized molecule and solvent representations often limit their ability to generalize. Here, we utilize directed message passing neural networks (D-MPNNs) to represent both dye molecules and solvents for predictions of molecular absorption peaks in solution. Additionally, we demonstrate a multi-fidelity approach based on an auxiliary model trained on over 28000 TD-DFT calculations that further improves accuracy and generalizability, as shown through rigorous splitting strategies. Combining several openly-available experimental datasets, we benchmark these methods against a state-of-the-art regression tree algorithm and compare the D-MPNN solvent representation to several alternatives. Finally, we explore the interpretability of the learned representations using dimensionality reduction and evaluate the use of ensemble variance as an estimator of the epistemic uncertainty in our predictions of molecular peak absorption in solution. The prediction methods proposed herein can be integrated with active learning, generative modeling, and experimental workflows to enable the more rapid design of molecules with targeted optical properties.
Many theoretical methods have been developed for predicting molecular optical properties, including empirical tables, semi-empirical methods, time-dependent density functional theory (TD-DFT), and wavefunction-based methods.3,4 TD-DFT has been the most widely used method for at least the past decade because of its favorable accuracy/cost trade-off and its capacity to be coupled with continuum solvents approximations,5 and it has been benchmarked and reviewed extensively.6,7 In parallel to theoretical methods, researchers have also developed surrogate statistical models that predict UV/Vis spectra from molecular structure at a lower computational cost than TD-DFT. ML studies for predicting properties related to the electronically excited states of molecules have been reviewed recently.8,9
Limitations in previous statistical modeling efforts can be classified into three categories: data sparsity, molecular representations, and solvent representations.10 Many studies have focused on a narrow part of chemical space (e.g. a single dye family) because of the limited availability of large UV/Vis datasets. This data sparsity has been addressed recently with the publication of several experimental datasets,1,11–18 described in Table 1. There are also several large computed datasets of excitation energies available (Table 2). However, studies are still lacking on how the chemical diversity of the training data impacts model performance on new, unrelated chemical space. Many prediction methods have created molecular representations based on generic structure-based fingerprints or human-selected descriptor features. Most previous studies did not consider solvent effects, but many leveraged descriptors derived from quantum chemical calculations. The issue of data sparsity is related to the shortcomings of solvent representations in previous models; with relatively few examples of dyes measured in more than one solvent, it was sometimes easier to train a model only on data in the most commonly reported solvent to remove this complexity from the model.
Dataset | Entries | Dye | Solvent | Absorption | Emission | Other |
---|---|---|---|---|---|---|
ChemDataExtractor14 | 8467 | SMILES | Name | λ max, εmax | — | — |
ChemFluor16 | 4386 | SMILES | Name | λ max | λ max | Φ |
Deep4Chem17 | 20236 | SMILES | SMILES | λ max, σabs, εmax | λ max, σemi | Φ, τ |
DSSCDB1 | 5178 | SMILES | Name | λ max | λ max | — |
Dye aggregation15 | 4043 | SMILES | Name | λ max | — | — |
Fluorophores.org12 | 955 | Name | Name | Full, λmax | Full, λmax | Φ, τ |
NIST11 | 2306 | MOL file | — | Full | — | — |
PhotochemCAD13 | 552 | Name | Name | λ max, εmax | — | Φ |
UV/Vis+18 | 112 | Name | Name | Full | Full | — |
Among the previous studies on predicting absorption peak wavelengths or excitation energies, the work of Ju et al.,16 Kang et al.,25 and Joung et al.26 is particularly noteworthy because of the size of their training datasets and the accuracies this enabled them to achieve. Ju et al. trained a gradient boosted regression tree (GBRT) algorithm on composite fingerprints to predict the maximum absorption and emission wavelengths and photoluminescence quantum yield (PLQY) using a large set of experimental data they compiled from the literature. Kang et al. trained a random forest algorithm on a subset of the PubChemQC database24 to predict B3LYP/6-31G* excitation energies and oscillator strengths in a vacuum from molecular fingerprints. Joung et al. used their previously compiled experimental dataset17 to train a model that uses graph-convolutional networks (GCN) to predict multiple molecular optical properties, including the absorption and emission peak wavelengths.
Although these recent works achieved impressive accuracies, their reported performance may be more representative of how they would perform in substituent-selection applications as opposed to de novo design tasks with unseen chemistries. Recent reviews of ML best practices in chemistry and materials science have warned against data leakage from the same compound or composition being present under multiple measurement conditions.27,28 Random splitting into training/test or training/validation/test sets based on dye–solvent pair may not be sufficient for assessing model generalizability on this task because test error may be spuriously low if a dye appears in both training and test sets in different solvents or even if the molecules in the test set are too chemically similar to the training data. We set out to explore how these decisions impact model performance.
Additionally, previous work has not attempted to leverage a combination of computed and experimental data in predictions of optical spectra. This multi-fidelity approach is desirable because of the lower cost and greater availability of calculations with respect to experiments. Multi-fidelity methods have been demonstrated in several other applications for integrating data from multiple levels of theory or from theory and experiments.29–31 Furthermore, theoretical methods do not have a domain of applicability constrained by a training set, so they are more reliable for making predictions on chemistries with low similarities to existing data. These factors suggest that a multi-fidelity approach may improve model accuracy and generalizability on experimental predictions and may be more useful in active learning.
In this work, we leverage recently-compiled experimental datasets and directed message passing neural networks (D-MPNN)32 to address previous limitations with molecular and solvent representations. The D-MPNN approach learns representations for the dye and solvent that are optimized for predicting absorption properties. We compare our optimized representations to a state-of-the-art fingerprint-based method and to alternative solvent descriptors. Our method produces interpretable representations and estimates the uncertainty in its predictions. We emphasize the importance of using rigorous splitting techniques for assessing the ability of a model to generalize to unseen chemistries, and we show that incorporating results from physics-based calculations into model training improves performance across several large datasets. The predictive capability of our techniques will enable more rapid design of molecules with target optical properties.
For a subset of the complete dataset (only ChemFluor, DyeAgg, CDEx, and DSSCDB), we began with the optimized geometry calculated with BP86-D3/def2-SVP in ORCA and performed an additional TD-DFT calculation at the ωB97XD/def2-SVP level with solvent corrections in Gaussian.43 The solvent calculations were done using the integral equation formalism polarizable continuum model (IEFPCM) and Gaussian defaults for excited state solvation. This pipeline was completed for 6707 dye–solvent pairs.
We extracted the peak vertical excitation energy from each of these calculations according to the following procedure: (1) if none of the energies were in the range of 1–4 eV, choose the lowest energy; (2) if only one energy is in the visible range, choose that one; (3) if multiple peaks are in the visible range, choose the one with the highest oscillator strength. While the vertical excitation energy is not exactly analogous to λmax,abs because it does not account for nuclear vibronic effects, it is a relatively cheap computational surrogate that should improve the capability of a model to predict λmax,abs.
The four solvent representations compared herein are Morgan fingerprints, ChemFluor Comprehensive General Solvent Descriptors (CGSD), Minnesota solvent descriptors,46 and Chemprop D-MPNN embeddings (SolventMPNN). We calculated the Morgan fingerprints with a radius of 4 and 256 bits. The five CGSD descriptors (developed by Ju et al.16 in conjunction with the FSD dye representation) were extracted from the work of Reichardt47 and Catalán48 and represent the polarity (ET(30)), acidity (SA), basicity (SB), dipolarity (SdP), and polarizability (SP) of a solvent. We also matched solvents with their seven corresponding descriptors from the Minnesota Solvent Descriptor Database:46 index of refraction (n), Abraham's H-bond acidity (α), Abraham's H-bond basicity (β), surface tension (γ), dielectric constant (ε), aromaticity (ϕ), and electronegative halogenicity (ψ). The solvent D-MPNN embeddings were optimized using a separate D-MPNN alongside that of the dye; this approach was previously shown to be successful in predicting solvation free energies.49 This is also similar to the “direct-solvent” approach of Chen et al.,50 except that our dye and solvent D-MPNNs are not connected to one another until their embeddings are concatenated before the fully-connected feed-forward neural network (FFNN).
All dye–solvent pairs for which any of the above features could not be calculated were dropped from the dataset.
D-MPNN hyperparameters (including hidden sizes, numbers of layers, dropout, batch sizes, learning rates, and warm-up epochs) were tuned using SigOpt.53 The GBRT used the hyperparameters reported by Ju et al., while all D-MPNN models used hyperparameters that were tuned on the same ChemFluor dataset used by Ju et al. The details of the model architectures, training, and predictions are given in the ESI.†
Dataset | Measurements |
---|---|
Deep4Chem | 16585 |
ChemFluor | 4170 |
DyeAgg | 3626 |
CDEx | 1915 |
DSSCDB | 2438 |
Total | 28734 |
The maximum absorption wavelengths of the dyes represented in our dataset cover the entire visible spectrum and extend into the ultraviolet and near-infrared regions. The molecules in the dataset include many of the common dye substructures and families. Fig. 2 illustrates the distribution of wavelengths and the prevalence of each substructure. The peak wavelength distributions of the individual datasets are different from that of the combined dataset and from each other, as shown in Fig. S1.†
The five datasets differ in their coverage of chemical dye space, as shown in Fig. 3. The largest single data source (Deep4Chem) also has the most dense coverage of the chemical space. The smaller data sources, while covering a relatively large area of space, display more outliers that have few or zero close neighbors. Quantitatively, the mean Tanimoto similarities to nearest neighbor (based on RDKit fingerprints) are 0.760, 0.917, 0.937, 0.891, and 0.884 for the CDEx, ChemFluor, Deep4Chem, DSSCDB, and DyeAgg sets, respectively. Histograms of the pairwise Tanimoto similarities for each dataset are shown in Fig. S3.†
Fig. 3 UMAP of Morgan fingerprints by data source. A UMAP dimensionality reduction on the Morgan fingerprints shows a difference in the coverage and density of each dataset in chemical space. |
We used TD-DFT to calculate the vertical excitation energies for 10947 molecules in vacuum and 6707 molecules with solvent corrections corresponding to the solvent measurements available in the experimental dataset. Since some molecules were measured in multiple solvents experimentally, the total number of experimental measurements with a corresponding vacuum TD-DFT calculation was 19409. The results of these calculations are compared to experiments in Fig. 4. Solvent corrections applied to TD-DFT yield results that have a smaller error than vacuum calculations with respect to the experimental ground truth. However, after fitting a linear regression to both sets of calculations, the error for the vacuum calculations is lower. Therefore, this systematic error of the vacuum calculations makes it suitable to use the results of the vacuum calculations as features for building our models.
Fig. 4 TD-DFT calculations in vacuum and solvent vs. experiments. (Left) Vertical excitation energy with maximum oscillator strength from solvent-corrected TD-DFT versus peak wavelength of maximum absorption from experiment, (Center) vertical excitation energy with maximum oscillator strength from vacuum TD-DFT versus peak wavelength of maximum absorption from experiment, (Right) vertical excitation energy with maximum oscillator strength from solvent-corrected TD-DFT versus vertical excitation energy with maximum oscillator strength from vacuum TD-DFT. In each plot, MAEreg refers to the adjusted MAE value after performing a simple linear regression. Similar plots are shown in Fig. S4† with data points across all three plots corresponding to the same measurements. The linear regression equations for each plot above are as follows: (Left) λexpt,solv = 1.69λtddft,solv − 238.24, (Center) λexpt,solv = 1.82λtddft,vac − 226.82, and (Right) λtddft,solv = 1.18λtddft,vac − 32.86. |
Although the combination of vacuum TD-DFT and a simple linear model performs well with respect to the experimental ground truth (MAE = 24.9 nm), the computational cost of TD-DFT may limit its applicability on large datasets. Nevertheless, this sets a baseline for the accuracy of computational methods in general when predicting this property.
We trained a D-MPNN and FFNN on 80-10-10 random splits of 28772 vertical excitation energies from our full set of vacuum TD-DFT calculations. This model achieved a test MAE of 0.12 eV (14.99 nm), and the predictions are shown in Fig. S12.† This model became the auxiliary model used in the ChempropMultiFidelity approach for the remainder of this work.
The three ML methods we benchmarked are all able to match or exceed the accuracy of the linear regression on the vacuum TD-DFT result, but their ability to do so is heavily influenced by the strategy used to split the data into training, validation, and test sets (Fig. 5). All three methods achieve a MAE under 10 nm when splitting by dye–solvent pair. For GBRT, this is similar to what Ju et al. observed in their predictions on the ChemFluor dataset (test set MAE of 10.46 nm). Performance worsens to MAEs of 13–21 nm when splitting by dye molecule and to 18–27 nm when splitting by dye scaffold. ChempropMultiFidelity achieved a test RMSE of 27.47 nm using the SolventMPNN representation and scaffold splits. This outperforms the graph-convolutional networks (GCN) approach of Joung et al., who reported a test RMSE of 31.6 nm on random splits. Across all three splitting strategies, ChempropMultiFidelity and Chemprop perform better than GBRT. ChempropMultiFidelity performs better than Chemprop on the two more rigorous splitting strategies. This indicates that the Chemprop and ChempropMultiFidelity methods have generalizability superior to that of GBRT.
Ju et al. report that when they partitioned their data into training and test sets based on dye molecules rather than dye–solvent pairs, their test MAE in emission peak wavelength increased only slightly from 14.09 nm to 15.25 nm, but they did not report similar numbers for absorption predictions. Joung et al. do not report results for splitting by dye molecules, and neither report results when splitting by dye scaffold. Splitting more rigorously results in a wider error and ensemble variance distributions (Fig. S9†), but this is a better assessment of the ability of the model to generalize and is thus more reflective of performance for de novo design tasks. When using the random splitting strategy, dye molecules that appear in both the train and test set in different solvents have narrower error and ensemble variance distributions than those that only appear in the test set (Fig. S10†). As shown in Fig. S11,† we can also compare the splitting strategies by calculating the similarity (based on Morgan fingerprints or latent space coordinate) of each molecule in the test set to its nearest neighbor in the training set and plotting the test set error as a function of this similarity. When the similarity scores (which range from 0 to 1) are grouped into bins of size 0.1, this illustrates the error distributions as a function of similarity. The maximum prediction error should be lowest for the bin of test molecules that are most similar to the training set, but this was not true for the random splitting strategy. This indicates that the model may be relying too much on the training data and failing to learn the solvent effect. In contrast, the more rigorous splitting strategies exhibited the expected behavior.
The improved generalizability of the two D-MPNN approaches over the GBRT method may be the result of the automatically-learned dye representations. ChempropMultiFidelity outperforms Chemprop because in addition to this automatically-learned representation, it also incorporates additional physical knowledge through the inclusion of a predicted TD-DFT value in the learned embedding. It should be emphasized that this predicted TD-DFT value comes from an additional Chemprop model rather than an actual TD-DFT calculation, so there is no need to perform an additional calculation to predict on an unseen dye molecule. The ChempropMultiFidelity method achives a MAE of 18.3 nm on scaffold splits, an improvement over the TD-DFT plus linear regression approach at a fraction of the cost.
We compared our ChempropMultiFidelity approach of incorporating a TD-DFT feature predicted by an auxiliary model to using true TD-DFT values. As shown in Fig. S13,† the detrimental effect of the noise introduced using the D-MPNN TD-DFT feature may be outweighed by the dramatic savings in computational cost and time from no longer needing to perform a new TD-DFT calculation for an unseen molecule.
In our comparison of four different solvent representations, we found that none substantially and consistently outperform the others. It is necessary to represent the solvent in some way to achieve good predictions, but the CGSD, Minnesota descriptors, Morgan fingerprints, and SolventMPNN approaches all achieve similar results. The Morgan fingerprint and SolventMPNN approaches do, however, have the advantage that they are computable for any solvent since they are not restricted to look-up tables as are the CGSD and Minnesota methods.
We used several additional datasets to further evaluate the performance of the three dye representations. ChempropMultiFidelity and Chemprop outperform GBRT on the ChemFluor, DSSCDB, and DyeAgg datasets and achieve MAEs of 17–23 nm, as shown in Fig. 6. All models perform substantially worse (55–62 nm) on the CDEx dataset, and the inclusion of the TD-DFT feature degrades performance. This may be a result of errors introduced by the automatic extraction method used to construct this dataset. The results of the aforementioned and additional experiments across different combinations of molecular and solvent representations and datasets are reported in terms of MAE, RMSE, and R2 in the ESI Fig. S5–S8 and Tables S4–S12.†
We explored the effect of combining datasets together and performed 5-fold cross-validation to draw more rigorous conclusions. After observing the exceptionally poor performance of all models on the CDEx dataset, we excluded it from the combined dataset. The GBRT method was excluded from this analysis to compare the Chemprop and ChempropMultiFidelity models using the D-MPNN solvent representation (which cannot be integrated into the GBRT method). ChempropMultiFidelity achieved a mean MAE, RMSE, and R2 of 27.78 nm, 47.13 nm, and 0.80, respectively. These scores outperformed Chemprop in all metrics, but there was overlap in the standard errors of all three scores. The complete results are shown in Table 4. The prediction errors on this larger, combined dataset are larger than those on the smaller datasets because of the different coverage of chemical space represented within each dataset. This is illustrated in Fig. S14–S18,† which show the results of training and predicting on different datasets. These results indicate that the ChemFluor and Deep4Chem datasets are relatively similar to one another, as are the DSSCDB and DyeAgg. Combining all four datasets together results in an inhomogeneous chemical space and thus lowers performance.
Model | MAE (nm) | RMSE (nm) | R 2 |
---|---|---|---|
Chemprop | 30.23 ± 5.62 | 52.08 ± 12.19 | 0.75 ± 0.08 |
ChempropMultiFidelity | 27.78 ± 5.07 | 47.13 ± 11.10 | 0.80 ± 0.07 |
Having demonstrated the effectiveness of the D-MPNN models for modeling the peak wavelength of maximum absorption, we used dimensionality reduction to examine the interpretability of these models. We extracted the fingerprint embeddings from one of the Chemprop models and applied the t-SNE algorithm to reduce them to two dimensions. When these plots are colored by the experimental peak absorption wavelength and by the dye scaffolds present in the dataset, some patterns and clustering emerge, as shown in Fig. 7. The number of dye scaffolds that are known to absorb light at lower visible (green-violet) and ultraviolet wavelengths is much greater than those known to absorb in the red-yellow range of the spectrum. This is apparent in the t-SNE plots, as the red points clustered on the right of the scaffold plot (corresponding to the BODIPY dye family) comprise nearly all of the red-green points on the wavelength plot.
Fig. 7 MPNN Embedding Interpretability through t-SNE. (Left) t-SNE plot (perplexity = 100) of molecule D-MPNN embeddings from the scaffold-split test set of the Deep4Chem dataset, using the Chemprop molecule representation and SolventMPNN solvent representation, colored by the experimental peak wavelength of maximum absorption. Colors outside the visible spectrum are shown as black. (Right) Same as (Left), but colored by dye family scaffold. The same plots are shown in Fig. S20† for a UMAP dimensionality reduction. |
While the accuracy of a model is very important, the ability to quantify uncertainty in predictions can greatly increase model utility. The level of confidence in a prediction or set of predictions can be used to motivate the selection of candidates that are most likely to succeed in experimental validation or to inform the choice of new measurements that will improve the model through active learning techniques. There are a plethora of methods available for estimating the uncertainty in NN models, and reviews of these methods have not found one method that consistently performs others across datasets and evaluation metrics.55–57 However, one approach that is often used because of its ease of implementation is the ensemble variance as a measure of epistemic uncertainty. We evaluated the effectiveness of this method for quantifying the uncertainty in our Chemprop model. The parity plot in Fig. 8 shows that the variance of an ensemble of five models is indeed high for many of the predictions that fall far away from the parity line. However, closer examination by plotting the square root of the ensemble variance versus the absolute prediction error gives a much more sobering view of the effectiveness of this uncertainty metric. In fact, the Spearman rank correlation for this set of predictions and uncertainties is only 0.52, suggesting that one should not necessarily consider the rank order of the prediction uncertainties to be a good approximation of the rank ordering of the prediction errors.
Our best method (ChempropMultiFidelity) included a model trained on the results of previous TD-DFT calculations, and we used the predictions of this model as inputs to a second model that accounted for the solvent to predict the experimental peak wavelength. This multi-fidelity approach improved the model generalizability and improved performance for more rigorous splitting strategies. Our best model achieved a MAE of less than 7 nm on a held-out test set from a random split of dye–solvent pairs in the Deep4Chem dataset, and near 14 nm and 19 nm when splitting by dye molecule and dye scaffold, respectively. This is substantially better than the predictions of TD-DFT calculations alone, and at much lower cost. Our multi-fidelity approach has the advantage that the lower-fidelity data can cover a larger area of chemical space than the higher-fidelity data. Future work should compare this approach to additional methods for training ML models on multi-fidelity data, such as transfer learning (e.g. Fig. S19†), imputation, Δ-ML, and multi-target weighted-loss-function approaches.
D-MPNN approaches perform well across many of the largest publicly-available datasets of the peak absorption wavelength. Additionally, the ChempropMultiFidelity model outperformed Chemprop on a union of the four largest datasets. It achieved a MAE of 27.78 ± 5.07 nm with 5-fold cross validation on scaffold splits of this combined dataset.
We also compared several solvent representations and found that CGSD, Morgan fingerprints, Minnesota descriptors, and D-MPNN fingerprint representations all performed similarly. The Morgan fingerprint and D-MPNN approaches may be advantageous, however, because they can be applied to any solvent because they are not restricted to look-up tables. Future work could also use nearest-neighbor imputation techniques to estimate CGSD or Minnesota descriptors that are not present in the look-up tables.
We demonstrated the qualitative interpretability of our D-MPNN models using dimensionality reduction on their latent space fingerprints, which showed some clustering based on dye scaffold and observed peak wavelength. We also showed that although ensemble variance can be used as a measure of the epistemic uncertainty in our D-MPNN model predictions, and in this case the ensemble errors are comparable in magnitude to the prediction errors, these uncertainties are not necessarily well-correlated with true prediction error on these datasets.
This work is a step toward methods to predict full absorption and emission spectra, and it can enable more rapid design of dye molecules with targeted optical properties.
Footnote |
† Electronic supplementary information (ESI) available: Code, data, supporting figures, and supporting tables. See DOI: 10.1039/d1sc05677h |
This journal is © The Royal Society of Chemistry 2022 |