Zhi-Wen
Zhao‡
ab,
Marcos
del Cueto‡
*a and
Alessandro
Troisi
a
aDepartment of Chemistry, University of Liverpool, Liverpool, L69 3BX, UK. E-mail: m.del-cueto@liverpool.ac.uk
bInstitute of Functional Material Chemistry, Faculty of Chemistry, Northeast Normal University, Changchun, 130024, Jilin, P. R. China
First published on 25th March 2022
We try to determine if machine learning (ML) methods, applied to the discovery of new materials on the basis of existing data sets, have the power to predict completely new classes of compounds (extrapolating) or perform well only when interpolating between known materials. We introduce the leave-one-group-out cross-validation, in which the ML model is trained to explicitly perform extrapolations of unseen chemical families. This approach can be used across materials science and chemistry problems to improve the added value of ML predictions, instead of using extrapolative ML models that were trained with a regular cross-validation. We consider as a case study the problem of the discovery of non-fullerene acceptors because novel classes of acceptors are naturally classified into distinct chemical families. We show that conventional ML methods are not useful in practice when attempting to predict the efficiency of a completely novel class of materials. The approach proposed in this work increases the accuracy of the predictions to enable at least the categorization of materials with a performance above and below the median value.
The goal of this work is to assess the ability of ML to predict the efficiency of interesting energy materials from completely new families and offer a new method to do so. In this context, by completely new, we mean materials that belong to a chemical family that is not present when training the model, and can be generated using chemical intuition, database searching or generative models.23 Then, one can use our methodology to screen these candidates and decide which ones will have a larger performance, reducing the number of candidates and accelerating the production of materials from new families. The methodology, in the most general terms, consists of constructing an ML model that is trained without any information on a new family of materials and assessing its quality in predicting the property of known elements of such family. Note that here we refer to training as the process of finding the optimum hyperparameters through a specific validation method. A practical problem is that the definition of “new” is not mathematically accurate, and the novelty of a material is related to (a combination of) electronic, geometric or synthetic features that cannot be captured by an algorithm, while they will appear as self-evident to any expert in the relevant scientific domain. The problem of predicting the target properties of data outside of the training domain is also often tackled with transfer learning, where a previously trained method is used a starting point when predicting data in a new domain. We chose to study the ability of ML to explore new chemical space in the context of predicting novel non-fullerene acceptors for organic solar cells (OSCs).24–26 The topic is of significant contemporary interest as the identification of non-fullerene acceptors is considered essential to develop a competitive OSC technology and recent improvements have seen an almost three-fold increase in efficiency in five years since the report of non-fullerene electron acceptor (ITIC).27,28 For this scientific problem, there are well-defined families of acceptors recognized by the community and used to categorize the recent advances in the field. We can, therefore, ask whether new families of non-fullerene acceptors could have been predicted without any information on any member of that family. In this work, we discuss how a conventional cross-validation results in an overoptimistic evaluation of models when they are eventually used to predict new classes of compounds. We aim to draw conclusions on the specific field of computer-aided discovery of OSCs acceptors but also, more generally, on a practical approach to assess the usefulness of ML methods for more exploratory research. There have been other recent studies that evaluate ML models with out-of-sample tests in materials discovery.29,30 There has been similar work to predict out-of-sample reaction yields,31,32 and work in risk minimization applied to organic molecules to improve domain generalization.33 We introduce in this work a modification providing a simple framework to train models to perform extrapolations. This change is shown to improve significantly the accuracy of the model when predicting out-of-sample materials, with respect to models trained with a usual cross-validation.
The growing interest on non-fullerene acceptor devices34 has produced a large amount of valuable data. While there is no general standardization of the processing condition to measure the power conversion efficiency (PCE) and some of the experimental details are not always available, the experimental datasets appear to be sufficiently accurate to enable data science analysis with a range of works reporting good predictive abilities. For instance, Haibo Ma's team collected 300 experimental data of small-molecule OSCs, and trained an ML model using 10-fold cross-validation with a PCE prediction accuracy RMSE of approximately 1.2%.35 They also trained another model with a different database by leave-one-out (LOO) cross-validation, achieving a good prediction accuracy.22 Arindam Paul et al. used extremely randomized tree models to predict the HOMO energies of the HOPV dataset,36 to accelerate the screening process of OSCs.37 Similarly, Salvy P. Russo's group focused on the screening uses of ML by training ML models with DFT data to approximate properties of organic photovoltaics (OPV) materials.38 Jie Min et al. adopted five common algorithms for polymer/non-fullerene OSC devices by 10-fold cross-validation with the best results achieved with the Random Forest method.39 The approach has been recently extended to the study of perovskite solar cells.40 Jeff Kettle's group used ML to analyze a dataset with 1850 OSC devices, and were able to identify which material properties play a major role in the device stability and degradation.41
A word of caution when assessing the accuracy of ML models is that experimental datasets will likely have distribution biases that will affect the reliability of predictive ML models trained with those datasets.42 There have been recent approaches to correct these biases by, for example, re-introducing data of failed experiments,43 adopting unbiased design of experiments44 and using new frameworks and metrics.19,45 Even though advances in this area will undoubtedly be beneficial, here we are interested in the accuracy of ML models when predicting new classes of materials, even in the best-case scenario when the dataset is balanced and representative of the field.
The data set used in this work contains experimentally investigated small-molecule organic photovoltaics whose chemical structures and PCE values are collated from the literature, with detailed information in our previous report46 and a public repository.47 In this work, we have used two distinct types of features that have previously proven useful in predicting different properties of donor–acceptor pairs:46,48 (i) fingerprints (also referred to as chemical descriptors in other works) consisting of the Morgan fingerprints49 of donors and acceptors, and (ii) physical descriptors consisting of:
HOMO and LUMO energies of the donor, the LUMO energy of the acceptor. The energies of the frontier orbitals are expected to affect the PCE of the solar cell, and they have been shown to improve the PCE prediction of ML models in organic solar cells.22
Reorganization energy of hole (for donor) and electron (for acceptor) transport. The reorganization energy is expected to correlate with the charge transport properties of the system.50
Sum of the oscillator strengths of the states below UV range (3.54 eV) for donor and acceptor. This parameter measures the optical absorption of the molecule and a high value is beneficial for photovoltaic activity.51
Measure of miscibility evaluated for both the donor and acceptor. We have approximated the miscibility of each molecule as the logarithm of the octanol water partition function, as a mixture of compounds with different hydrophobicity is more likely to segregate. The logarithm of the octanol water partition function was calculated by the XLOGP3 (ref. 52) method, which is commonly used for organic molecules,53 using the SwissADME54 web tool.
More details on these descriptors, why they were selected for this dataset, and how they compare against other descriptors, can be found in ref. 46. We detail the level of theory used to calculate each descriptor in the ESI.† Our final database consists of 566 donor/acceptor pairs, 49 of which contain non-fullerene acceptors and it is available as a stand-alone dataset.47 In these 566 pairs, we have a total of 33 distinct acceptor molecules shown in the ESI,† which also include the computational details used to obtain all features.
In this work, we have used the kernel ridge regression (KRR)55 algorithm, which is commonly applied to organic molecules datasets by several authors6,56 and was used for the same dataset in ref. 46. We have also used k-nearest neighbors regression (k-NN)57 and support vector regression (SVR).58 The results in the manuscript were obtained with KRR, and we show in the ESI† how k-NN and SVR produce relatively similar results. These algorithms use the “distance” with the training data to predict the PCE of unknown data, and we show a more in-depth explanation in the ESI.† We chose these algorithms as they are easy to implement and are commonly used for this type of application in materials science.59 All these algorithms struggle when predicting the PCE of new families of compounds, which is why we have proposed a new training framework in this work, which can be applied to any algorithm, and improves their extrapolation capabilities. Novel molecules are expected to be more distant in the parameter space, so we want to explore the ability of the algorithm and features to predict properties without nearby known structures. As described in ref. 48, the physical descriptors distances are calculated as the Euclidean distance between physical descriptor values. The chemical similarity of materials in the database is evaluated via the Tanimoto index,60 which is obtained from the Morgan fingerprints to characterize how similar each molecule is to others. The hyperparameters were optimized using a differential evolution algorithm,61 as implemented in SciPy.62 When training the model, the hyperparameters (including feature weights) are optimized, and their values for each case are shown in Tables S3–S5 in the ESI.† We show in Fig. 1 the workflow of this work, in which one first preprocess the data to generate a suitable database (described in more detail in ref. 46), then one trains the model by selecting a specific validation method to optimize the hyperparameters of the model (discussed in Sections II–IV), and finally one can deploy the model to screen candidate materials by their predicted PCE values (example shown in Section V).
y = f(x; h, D) | (1) |
The hyperparameters, h, are normally found by cross-validation. Subsets A1, A2,… of D are selected. Indicating with D − Ak the set obtained by removing the subset Ak from D (with − indicating the exclusion operator) the hyperparameters are chosen to minimize the total square error:
(2) |
It is common to construct the sets {Ak} as a random partition of the data in n subsets of equal size and the resulting method is known as n-fold cross-validation. Another common approach is to partition D in as many subsets {Ak} as the elements of D, with each subset containing just one element. This approach is known as leave-one-out (LOO) and corresponds to optimizing the ability of the function to predict a particular data point without any information on it.
If the data set is made of different families of related materials, the cross-validation methods above would favour the process of interpolation between data points. The subset D − Ak will always contain many elements of the same family and the optimization of the error in eqn (2) does not really reflect the ability of the function f to predict properties of a completely new set of materials. To emphasize this aspect, we refer to the results using this type of cross-validation as LOO-interpolation.
A naïve approach to deal with this issue would be to exclude all elements of a particular family An from the cross-validation, perform any form of cross-validation with the remaining data D − An and evaluate the predictive ability of the resulting method on the elements of An. This approach provides a measure of how well a standard ML approach predicts the properties of new families of compounds if no element of that family was used in its training. For example, one could perform a LOO cross-validation excluding in turn families of molecules, and we refer to this elementary approach as LOO-extrapolation. We show an example dataset with four distinct families in Fig. 2A, and we show in Fig. 2B how the data would be split with LOO-extrapolation to predict the values of one of these families. There have been similar approaches suggested recently, like the leave-one-cluster-out cross-validation,29 where the dataset is split in clusters and one tries to predict the values for each cluster, which has been left out of the training set. Another approach is the k-fold-m-step forward cross-validation,30 which ranks data by their target property value and evaluates how well the model can predict target values outside of the training domain, i.e. the model is trained to perform extrapolations of candidates in a different range of the target property, not necessarily to extrapolate to candidates from a different domain in the feature space. For example, a function tuned for prediction within the set D − An will be exploiting the existence of elements of the same family within D − An and it may not result in the best function to predict properties when no information on similar materials is available. The best chance to build a model able to perform accurate predictions on new families of materials is to train the model to do so.
We partition D in subgroups containing chemically distinct families {An}. For each subgroup Am, we find the parameters h that minimize the following error:
(3) |
In essence, leaving out group Am, we consider in turn all the other groups An and compute the error in predicting elements in An without using the data in An (and neither Am). The scheme is illustrated in Fig. 2C. Minimizing eqn (3) with respect to h is equivalent to finding the best function at performing extrapolations. This approach simulates a situation where no element of the family Am has been discovered, and the remaining elements can be divided into distinct families. We call this method leave-one-group-out (LOGO) cross-validation. The procedure can be repeated for every Am, where each one results in an optimal set of hyperparameters hm,Dmin,LOGO. The RMSE error can then be defined as:
(4) |
We have referred to this use of a LOGO cross-validation to train the ML model to perform extrapolations as LOGO-extrapolation (see Fig. 2C). Note that this training can be applied to any ML algorithm, and it is ultimately based on a cross-validation approach that mimics the discovery of novel material classes to overcome the inherent ‘leakage’ of information that is present in other cross-validation methods. We are effectively optimizing the model's parameters so that they are good at generalizing predictions of novel groups. Similar goals to optimize the training process for a specific task can be found in the meta-learning of neural networks.63,64
Features | RMSE (%) | r | |
---|---|---|---|
LOO-interpolation | Fingerprints | 1.75 | 0.69 |
Physical descriptors | 2.01 | 0.56 | |
LOO-extrapolation | Fingerprints | 3.52 | 0.08 |
Physical descriptors | 4.11 | 0.17 | |
LOGO-extrapolation | Fingerprints | 3.77 | 0.07 |
Physical descriptors | 2.84 | 0.31 |
A preliminary assessment of the data is offered by comparing the RMSE, as shown in Table 1. In this table, we can see how the correlation coefficient is very low and RMSE is large when using the fingerprints with either LOO-extrapolation or LOGO-extrapolation, indicating that extrapolating data based on this information alone is more challenging. We can also see how the LOGO extrapolation presents a clear improvement over the LOO-extrapolation. The best performance obtained with LOO-extrapolation results in a RMSE of 3.52%, and LOGO-extrapolation improves the RMSE to 2.84% (a relative improvement of 19%). Similarly, the best correlation obtained with LOGO-extrapolation (r = 0.31) is also significantly larger than the correlation obtained with LOO-extrapolation (0.08–0.17). To see this improvement when using LOGO-extrapolation, using a grouping with chemically distinct families, we present in Section S6.2 of the ESI† a comparison of LOO-extrapolation and LOGO-extrapolation when using another grouping. The best RMSE achieved with LOGO is still far from the one obtained with a regular LOO cross-validation (interpolating known data), and it results in a coefficient of determination of R2 = 0.04, which makes it not suitable to perform quantitative predictions. However, researchers are not necessarily interested in a model that can make accurate quantitative predictions, and a model that can do a binary classification to separate candidate materials into well-performing and bad-performing can be equally helpful.
We show in Fig. 4 the PCE distribution of the complete dataset and the distribution for the 45 points corresponding to the non-fullerene materials classified into one of our five chemical groups, which we will try to predict. In Fig. 5, we show our PCE prediction of these 45 points when using LOO-extrapolation and LOGO-extrapolation, as well as the experimental PCE. One can clearly see how the LOGO-extrapolation distribution is much closer than the LOO-extrapolation one to the experimental PCE distribution (closer median and lower/upper quartile values), although both models struggle to predict high PCE values.
We have already mentioned the overall smaller RMSE and larger correlation (see Table 1) when adopting LOGO-extrapolation. However, we can go one step further and quantify how advantageous LOGO-extrapolation would be with respect to LOO-extrapolation when trying to identify materials over a certain threshold. We have used three different thresholds. We have chosen the lower quartile (Q1PCE), median value (μPCE) and upper quartile (Q3PCE) as statistically significant values to judge the ability of the model to do qualitative classifications (the threshold values are shown in Fig. 5). These thresholds allow to judge how the model performs for different classifications with an increasing difficulty:
(i) Identify materials with PCE > Q1PCE. This simple threshold allows us to evaluate how well the model would do in identifying the worse performing materials.
(ii) Identify materials with PCE > μPCE. With this threshold, we have the best data distribution and we can classify candidate materials as well-performing and bad-performing, reducing possible candidates by half.
(iii) Identify materials with PCE > Q3PCE. This threshold is more challenging, and allow us to quantify how many materials in the top 25% of our dataset are correctly predicted within that range.
Each predicted PCE can be classified as true positive (TP), false positive (FP), true negative (TN) or false negative (FN), as shown in Table 2.
Predicted PCE > threshold | Predicted PCE < threshold | |
---|---|---|
Actual PCE > threshold | True positive (TP) | False negative (FN) |
Actual PCE < threshold | False positive (FP) | True negative (TN) |
We can directly measure the accuracy of our model by calculating the probability of making a correct prediction:
(5) |
Additionally, we also calculate the precision, P, as the fraction of predicted well-performing materials that are actually well-performing,
(6) |
(7) |
These metrics are common for these types of binary classifications,79,80 and they are often averaged in a single metric, the F1-score, which we use as another indicator of the classification accuracy:
(8) |
We show the results for the two thresholds PCE > Q1PCE and PCE > μPCE in Table 3. For the lowest threshold of PCE > Q1PCE, it is worth noting how a LOO-extrapolation results in a mediocre accuracy and F1-score (A = 0.33 and F1 = 0.29). The precision is relatively high (P = 0.67), but the recall is quite low (R = 0.18), which means that we are missing most of the actual well-performing materials. However, when we use a LOGO cross-validation, both the accuracy and F1-score are significantly larger (A = 0.80 and F1 = 0.87), and the number of false negatives is reduced further, resulting in a larger recall (R = 0.88). In other words, if no special care is taken, an ML algorithm trained with a regular cross-validation performs very poorly when trying to predict new classes of molecules. However, a suitable cross-validation, like the one proposed in the LOGO-extrapolation, can significantly improve the predictive power of the ML model. When we use a higher threshold, PCE > μPCE, we observe a similar trend, although now both LOO-extrapolation and LOGO-extrapolation result in the same accuracy (A = 0.51) and a similar precision (P = 0.50 and P = 0.53, respectively). However, LOGO-extrapolation still results in a much larger recall and F1-score (R = 0.35 and F1 = 0.42) when compared to LOO-extrapolation (R = 0.09 and F1 = 0.15). When we use the most challenging threshold of PCE > Q3PCE, both LOO-extrapolation and LOGO-extrapolation struggle and they are not able to correctly predict any of the points in that interval. Therefore, it seems that this approach is advantageous to identify low-performing materials, but is unable to correctly identify high-performing materials. We show in Fig. S2 in the ESI† all real and predicted values when using LOO-extrapolation and LOGO-extrapolation.
Metric | PCE > Q1PCE | PCE > μPCE | ||
---|---|---|---|---|
LOO-extrapolation | LOGO-extrapolation | LOO-extrapolation | LOGO-extrapolation | |
TP | 6 | 30 | 2 | 8 |
FP | 3 | 5 | 2 | 7 |
TN | 9 | 6 | 21 | 15 |
FN | 27 | 4 | 20 | 15 |
Accuracy | 0.33 | 0.80 | 0.51 | 0.51 |
Precision | 0.67 | 0.86 | 0.50 | 0.53 |
Recall | 0.18 | 0.88 | 0.09 | 0.35 |
F 1-Score | 0.29 | 0.87 | 0.15 | 0.42 |
We excluded molecules that belonged to any of the known classes of non-fullerene acceptors in our dataset. Fig. 7 reports nine molecules, along with their predicted PCE, when combined with the best performing donor in our database (ZnP-TBO82). We can see how the PCE range of these molecules is similar to the range predicted for the other non-fullerene acceptor groups (see Fig. 5), and three of them have a predicted PCE above the median value for all other NFA in the database, which suggests them as high interest candidates. Additional considerations like cost and ease of synthesis from precursor could be considered, and domain knowledge would be particularly critical in the design phase.
Fig. 7 Nine molecules unknown to the model, and their suggested PCE using the LOGO-extrapolation framework, with KRR and physical descriptors. |
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d2dd00004k |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2022 |