Maitreyee Sharma
Priyadarshini‡
a,
Oluwaseun
Romiluyi‡
a,
Yiran
Wang
a,
Kumar
Miskin
b,
Connor
Ganley
a and
Paulette
Clancy
*a
aDepartment of Chemical and Biomolecular Engineering, Johns Hopkins University, 3400 North Charles Street, Baltimore, 21218, Maryland, USA. E-mail: pclancy3@jhu.edu
bDepartment of Materials Science and Engineering, Johns Hopkins University, 3400 North Charles Street, Baltimore, 21218, Maryland, USA
First published on 13th November 2023
The lack of efficient discovery tools for advanced functional materials remains a major bottleneck to enabling advances in the next-generation energy, health, and sustainability technologies. One main factor contributing to this inefficiency is the large combinatorial space of materials (with respect to material compositions and processing conditions) that is typically redolent of such materials-centric applications. Searches of this large combinatorial space are often influenced by expert knowledge and clustered close to material configurations that are known to perform well, thus ignoring potentially high-performing candidates in unanticipated regions of the composition-space or processing protocol. Moreover, experimental characterization or first principles quantum mechanical calculations of all possible material candidates can be prohibitively expensive, making exhaustive approaches to determine the best candidates infeasible. As a result, there remains a need for the development of computational algorithms that can efficiently search a large parameter space for a given material application. Here, we introduce PAL 2.0, a method that combines a physics-based surrogate model with Bayesian optimization. The key contributing factor of our proposed framework is the ability to create a physics-based hypothesis using XGBoost and Neural Networks. This hypothesis provides a physics-based “prior” (or initial beliefs) to a Gaussian process model, which is then used to perform a search of the material design space. In this paper, we demonstrate the usefulness of our approach on three material test cases: (1) discovery of metal halide perovskites with desired photovoltaic properties, (2) design of metal halide perovskite-solvent pairs that produce the best solution-processed films and (3) design of organic thermoelectric semiconductors. Our results indicate that the novel PAL 2.0 approach outperforms other state-of-the-art methods in its efficiency to search the material design space for the optimal candidate. We also demonstrate the physics-based surrogate models constructed in PAL 2.0 have lower prediction errors for material compositions not seen by the model. To the best of our knowledge, there is no competing algorithm capable of this useful combination for materials discovery, especially those for which data are scarce.
New conceptsMaterials discovery is currently in a state of renaissance of importance, thanks to the acceleration possible through the application of machine learning tools. This paper presents a novel materials discovery algorithm that is based on a Bayesian optimization framework. The key novelty of our method, PAL 2.0, is the construction of a physics-based prior mean for the Gaussian process surrogate model. We achieve this in two steps: first, using XGBoost to select the physical descriptors most correlated to the target property being optimized. Second, we use those selected physical descriptors as the input encoding vector to a neural network model that predicts the target property. This combination of XGBoost with neural networks provides a physics-based prior model of the material space to inform a Gaussian process model. The two most compelling contributions of PAL 2.0 are that we demonstrate superior optimization performance by finding the optimal target within the lowest number of iterations when compared to state-of-the-art models such as a representative off-the-shelf Bayesian optimization package, SMAC, as well as one-hot-encoded Gaussian process models for material discovery, and that we provide a predictive physics-based model for the material space capable of offering valuable chemical insights. Overall, PAL 2.0 offers great potential to advance the field of materials discovery, offering researchers and practitioners a powerful and easy-to-use tool to accelerate the development of materials for critical applications in energy, health, and sustainability. |
Computational tools that can accelerate material discovery generally fall under one of the following three paradigms: (i) feature engineering, (ii) predictive models for molecular properties, and (iii) optimization algorithms, Fig. 1. Feature engineering refers to the extraction of correlations between variables using raw data. Such methods enable us to gain physical and chemical insights into material systems that can then inform material discovery tools. For example, in ref. 4, the authors used a modified version of the SISSO5 method to extract features that assist in classification of solid-state materials into classes such as perovskites, spinels and rare-earth intermetallics. On the other hand, availability of large data sets through sources like the Materials Genome Initiative6,7 and high-throughput quantum chemistry frameworks has led to the development of several neural network and machine learning models for molecular property prediction.8,9 Such models can be used to predict properties of unknown materials and hence inform the material discovery process. However, while feature engineering and predictive modeling offer tools for material discovery, the challenge of navigating potential molecular combinations persists. That is where the third paradigm of methods lies that we call “optimization algorithms” or, more commonly, material discovery methods. These methods provide efficient search and optimization strategies to navigate the often overwhelming combinatorial space of materials candidates.
A brute-force approach to finding the optimal elemental combination of a material for a given target (e.g., the best solar cell efficiency) involves randomly and exhaustively exploring the material space. This approach does not leverage information from previously explored candidates nor expert domain knowledge to improve its search strategies. Additionally, it is typically infeasible unless examining smaller combinatorial spaces. Likewise, evolutionary methods, like Genetic Algorithms,12 have also been used for categorical domain optimization. However, such evolutionary methods are locally exploitative and therefore, can get trapped in locally (rather than more globally) optimal regions.
In recent years, BayesOpt has become a widely adopted algorithm for global optimization of black-box functions that are expensive to evaluate.13–16 It has been used to optimize a wide range of problems, including automatic algorithm configuration, automatic machine learning toolboxes, and optimization of combinatorial spaces for materials and drug discovery.10,17–24 A BayesOpt algorithm essentially requires two sets of functions: (i) a “surrogate model” for the objective function and (ii) an “acquisition function” that is updated, based on the surrogate model, to provide a recommendation for the next candidate to explore. Some typical examples of surrogate models include Gaussian Processes,25 random forests26 and Bayesian Neural Networks.27 The most commonly used acquisition functions include probability of improvement,28 expected improvement,29 and upper confidence bound.30
The application of BayesOpt in material discovery requires additional considerations. Unlike categorical optimization in machine learning and hyperparameter optimization, the notion of similarity and representation of candidate choices becomes important in material science applications. Commonly used “one-hot-encoding” representations for chemical and material domains fail to capture the true physical and chemical similarity between candidate choices, as depicted schematically in Fig. 1 (top right). The drawback arises from using binary variables to depict a design choice such that all design choices are equally similar to each other since they all differ by one Hamming distance. Another representation for materials involves providing compositional information and structural information in terms of graphs.17 However, training models to accurately encode structural data require large training data sets that are frequently unavailable. In reality, there is an obvious similarity between materials that have similar chemical and physical properties. Optimization strategies that can leverage physical and chemical information to determine similarity between candidate choices are expected to accelerate the material discovery process, as was demonstrated by Hase et al.10 through the innovative Gryffin method.
In this paper, we have developed a new materials discovery algorithm, Physical Analytics pipeLine 2.0 (PAL 2.0), in which we leverage domain knowledge, appearing in the guise of chemical and physical properties, to develop surrogate models that are then used within a BayesOpt framework. A description of the construction and workflow of PAL 2.0 are discussed in the following sections and in the ESI.†
The new approach, PAL 2.0, presented in this paper, is a generalization of that original version of PAL.11 Both involve a Bayesian optimization framework that uses a physics-informed Gaussian process (GP) model. However, there are three key improvements and novel capabilities of PAL 2.0 compared to its progenitor PAL version. The new proposed framework involves: (i) descriptor selection for the search space based on decision trees, (ii) construction of a physics-based prior mean function using neural network (NN) models, and (iii) construction of a GP model using the NN prior mean function and subsequent use of this model in BayesOpt. Mathematical details of the model construction are provided in the Methods section (Section 4) and the overall workflow of the method can be seen in Fig. 2. Note that details of the nomenclature used in this work are given in the ESI† (Section S1).
As mentioned earlier, every material can be characterized in terms of its physical and chemical properties, but a priori knowledge of which properties are more important in optimizing the target variable is often lacking. By using XGBoost as part of the PAL 2.0 framework, we pick out the physical descriptors that are most representative of the material domain, making the search essentially unbiased toward expert knowledge, which, in many cases, is unknown. The algorithm typically finds a small number of important properties that correlate with the chosen target (rather than just one) and, importantly, can autonomously determine their relative weighting in a manner that even an expert might be unable to do. In the PAL 2.0 workflow, the physical descriptors chosen by XGBoost become the input variable set for the GP surrogate model.
When fitting GP models on scarce data such as those encountered in materials discovery, the main challenge is to obtain suitable prior knowledge and encode it into the model either through the kernel function or the mean function. In the machine learning literature, research has mainly focused on approximating the kernel function of the GP model using NNs.34 Training deep kernel functions, however, poses two main constraints: (i) they require large training datasets and (ii) they have to be positive definite in order to define an inner product on the material search space. The mean function, in comparison, has no such constraints and therefore, can be trained more easily to create a predictive GP prior. Furthermore, the assumption that all prior information can be encoded in the kernel function of the GP model when using a zero mean function (m(xD) = 0) does not always hold. For example, if the optimization landscape is such that in some regions we have a non-zero objective function value and in other regions the objective function value is zero, we can easily prescribe a prior to the mean function that will encode this information exactly but we cannot ensure the same with a kernel function. Therefore, obtaining an informed prior mean function allows more flexibility and guarantee in encoding prior knowledge. The contribution of our method is to create such a physics-based prior mean function using neural networks (NN). Having a predictive and accurate description of the optimization landscape allows the acquisition function to quickly find the optimal material in a few iterations. As a result, the NN ‘prior mean function’ ultimately boosts the performance of the Bayesian optimization step, as seen in the results.
We demonstrate the performance of the PAL 2.0 methodology on three material data sets relevant to energy applications.
1. Organic semiconductors:
(a) Target: electron affinity of the semiconducting polymer.
(b) Objective: maximize electron affinity.
(c) Possible combinations: 64.
2. Mixed B-site perovskites:8
(a) Target: band gap of perovskite crystal.
(b) Objective: minimize band gap.
(c) Possible combinations: 244.
3. Perovskite molecule–solvent binding energy:11
(a) Target: intermolecular binding energy of perovskite complex and solvent.
(b) Objective: maximize perovskite–solvent binding energy.
(c) Possible combinations: 240.
We also stress test the method by assessing the effect of varying amounts of initial training datasets and by running PAL 2.0 on a very large dataset of approximately 70000 COF structures that find applications in methane storage.35 These stress tests are added in the ESI† (see Section S6). In the subsequent sections, we describe each data set, the data set source and the BayesOpt performance of different methods. For each dataset, the prediction accuracy of the surrogate models (GP-0 and GP-NN) is compared using mean squared error which is computed as:
![]() | (1) |
We analyzed a DFT-generated small subset of this design space of four design choices for each species, leading to 64 unique combinations of the resulting p-doped semiconducting material. This data originates from a detailed study on three polymer segments, each differentiated by its Lewis basicity, backbone functionality, and solid-state microstructural attributes.38 Within this data set, PAL 2.0 was leveraged to identify which properties, if any, from a set of physical constants available from open-source databases39 and DFT calculations, are most important when optimizing a polymer–dopant–solvent system for the EA of the doped complex. We selected the electron affinity as the target metric because p-doping is known to enable charge transfer if there is a sufficient offset between the polymer's HOMO/ionization potential (IP) and the dopant's LUMO/EA.40 Details on the DFT methods used to calculate the electron affinity for each combination are given in the ESI† (Section S3A).
PAL 2.0 achieved an optimal target value while only exploring, on average, roughly 30% of the design space, a modest improvement over existing optimization algorithms (see Fig. 3). The fraction of the space explored is determined as
![]() | (2) |
![]() | ||
Fig. 3 Performance of the GP-NN model on the material discovery of doped p-type organic semiconducting polymers. (A) Material design space, (B) physical property representation of material design features shown in (A). Details of the physical properties are given in Table S1 (ESI†). (C) Predictive accuracy of GP-NN model against state-of-the-art GP models commonly used in material discovery, (D) superior Bayesian optimization performance of the GP-NN model (orange box) compared to state-of-the-art models (indicated by needing less of the parameter space to explore before successfully reaching its target) and (E) Selection of important (well-correlated) physical properties selected by the algorithm from the property space in (B). |
The data set features 244 unique formulations, sourced from specific mixtures of three halide ions, four A-site cations, and six B-site cations. In terms of halides, the selection is made up of the routinely utilized chloride, bromide, and iodide ions. For the B-site, while the most commonly adopted species are lead (Pb) and tin (Sn), the data set also incorporates options to consider germanium (Ge), calcium (Ca), barium (Ba), and strontium (Sr). On the A-site front, the options consist of formamidinium (FA), methylammonium (MA), cesium (Cs), rubidium (Rb), and potassium (K). The design choices for each feature are depicted in Fig. 4. The “property basket” consists of the descriptor choices selected by Mannodi-Kanakkithodi et al., which were utilized in predicting the bandgap of the mixed perovskite species.8 Since the perovskites consist of B-site alloys, properties for the B-site are given as a weighted average of the elemental physical properties. The weights are given by the elemental composition at the B-site.
To pinpoint the combination with the lowest bandgap (MA, Pb, I), PAL 2.0 (GP-NN) proved highly efficient. On average, it explored just 11% of the available design space after 200 BayesOpt trials (see Fig. 4). In contrast, the random search on average explored 50% and a GP model with a 0-prior mean choice (GP-0) searched 15% of the space. These trials used randomly initialized input data from 24 combinations, representing 10% of the entire space. Our feature engineering approach identified the electron affinity of the A-site cation as the most important descriptor in representing the design choices of this feature. For the B-site cation, the electron affinity and ionization energy of the ion were identified as important descriptors, but their importance paled in comparison to the electronegativity – which was identified as the most important differentiator between B-site cation design choices. Lastly, for the halide ions, ionic radius and density were identified as the most relevant properties.
The examined data set consists of five key features: the solvent molecule, choice of A-site cation (A), and choices of the three halide ions (X, Y, Z). Together with a central lead ion, these form the Pb-A-XYZ perovskite structure. The A-site design choices includes cesium (Cs), methylammonium (MA), and formamidinium (FA), while halide options consist of iodide (I), bromide (Br), and chloride (Cl). We provided eight solvent options based on a list of commonly used solvents for perovskite processing. They include: tetrahydrothiophene 1-oxide (THTO), dimethyl sulfoxide (DMSO), dimethylformamide (DMF), N-methyl-2-pyrrolidone (NMP), gamma-butyrolactone (GBL), acetone, methacrolein (METHA), and nitromethane (NITRO).
We selected properties for each feature (X/Y/Z-halides, A-site cation and solvent) based on prior physical knowledge of their potential impact on the binding energy (our target variable). We explored four basic properties for the halide features: electronegativity, electron affinity, ionization energy, and ionic radius of the halide. For the A-site cations, we considered the ionic radius, binding enthalpy of DMF towards the A-site cation,33 the dipole moment and the number of potential hydrogen bonding atoms of the A-cation. Finally, for the solvent feature, we considered six properties: the Gutmann donor number (DN),1,44,45 Lewis acceptor number (AN),46 lithium cation affinity (LCA),47 dielectric constant,48 dipole moment and molar volume (MV) of the solvent molecule.
Using PAL 2.0's GP-NN, we identified the combination with the highest binding energy (Br, Cl, Cl, FA and THTO) by exploring 11% of the available design space. Comparatively, it took GP-0 with filtered property descriptors 13% percent of the design space to locate the optimum combination. A OHE GP-0 model took 16 percent of the space to do so (with large variability in its convergence results), see Fig. 5. Other methods, like SMAC and HyperOpt explored over 20–40% of the design space before they were able to locate the best combination. Additionally, these benchmark methods had large variability in their results over the BayesOpt trials.
For property descriptors, there were no standout properties selected to represent the halide and cation features, each selected property having a similar level of importance. On the other hand, two standout properties (dielectric constant and donor number) were identified for the solvent feature to differentiate between the various solvent design choices. The choice of these properties is significant since the simulations that created this dataset are run with an implicit solvent thereby making the dielectric constant an important differentiator for the solvent choices. Additionally, the ability of PAL 2.0 to discern significant features for the solvent through its XGBoost component exemplifies the system's capability to surpass expert selection. The initial version of PAL, as presented in Herbol et al.,11 utilized a physics-based prior reliant on expert-derived factors such as the dielectric constant and solvent density. In contrast, PAL 2.0 independently recognized and attributed significance to these same features, underscoring the advanced feature selection capabilities of XGBoost and its effectiveness in this context.49 Leveraging this capability, PAL 2.0 astutely pinpointed the Gutmann donor number (DN) as a critical property—a measure acknowledged for its efficacy in isolating potent solvents for the solution processing of metal halide perovskites.1,31,45,48,50–53
The novelty of our work lies, firstly, in the incorporation of important physical descriptors selected by the XGBoost algorithm to enhance the physical realism of our surrogate model. Secondly, in the construction of a physics-based prior mean using a neural network approach. The net result of these novel approaches is to leverage physical domain knowledge specific to the system of interest. However, it should be noted that another advantage is that the descriptor selection done by PAL 2.0 dispenses with the need/requirement to be an expert with an understanding of which descriptors/features are the most informative for the system. A semi-expert user is free to provide a list of descriptors that might be important and the method will-autonomously-choose the most appropriate ones from that list. As a result, PAL 2.0 is able to find the optimum target objective faster (i.e., in fewer iterations) than many state-of-the-art optimization methods, including SMAC,54 Hyperopt,55 and Genetic Algorithms.12
The performance of PAL 2.0 is demonstrated on three material data sets which include doped p-type organic semiconductors and perovskites. Both these classes of materials show immense promise for the next generation of solar cells, but the algorithm itself is completely materials-agnostic and indeed can be used for applications well outside the realm of materials. Any application for which the parameter set is either large enough to discourage a systematic search or the data are sparse and/or expensive (i.e., tackling both ends of the data set size), and the features that are most closely correlated with the objective are largely unknown is a suitable candidate for exploration using PAL 2.0. In each of the material cases we studied here, we have shown that PAL 2.0 outperforms all other methods we tested. Within the PAL 2.0 framework, the GP-NN model that combines some neural network assistance in concert with BayesOpt exhibits the best convergence and predictive capabilities.
Furthermore, the surrogate model constructed by PAL 2.0 provides valuable chemical insight into the material system, which can be transferred to learning domains outside of the training set. For example, in the doped p-type semiconducting polymers, of all the descriptors provided, LUMO was selected by the method as the most important physical descriptor when optimizing for EA. This is consistent with previous findings which show that the dopant's EA is most correlated to its LUMO40 and that the EA of the dopant dictates the EA of the entire polymer–dopant complex.38 Additionally, earlier studies research have underscored the significance of the Gutmann donor number (DN) and dielectric constant as pivotal descriptors for distinguishing solvents in the solution processing of metal halide perovskites.1,31,45,48,50–53 These findings show that the physics-based surrogate model, embedded with necessary property descriptors, could be a great starting point to find material candidates in similar domains with scarce data.
In summary, the PAL 2.0 approach exhibited the following advantages:
1. It outperforms or, at the very least, is competitive with, the optimization performance of other BayesOpt approaches that we tested.
2. It has the ability to select physically relevant descriptors for the surrogate model and their relative weighting.
3. The test errors (MSE values) of the GP-NN surrogate model are lower than other models, implying that GP-NN is more predictive.
4. Having a model that is predictive opens up the possibility of optimization in different ranges of target values for different applications where data are scarce, and finally.
5. Can initiate material discovery for a material system with as few as 25 observations from experiments or computation.
Each step of the algorithm is discussed in the following subsections.
μi = m(xD) + k(xD,x(t)D)(k(xD,x(t)D) + η2I)−1(y(xD) − m(xD)) |
vi = k(x(t)D,x(t)D) − k(xD,x(t)D)(k(xD,x(t)D) + η2I)−1k(xD,x(t)D)T | (3) |
A popular prior mean function choice is the 0-mean function, m(xD) = 0. This prior mean function is useful when we have very few, or no, observations of our system. It helps in preventing ad hoc assumptions and biases that might be included in the model by forcing a functional prior mean function choice. However, in the case of material discovery, we often start with some observations of the space. In this work, we leverage these observations to pick out the most relevant physical descriptors, as discussed in Section 4.1.1, and construct a prior mean function for the GP model. The novelty of our approach lies in the construction of a physics-based prior mean function (m(x)) using NN for the GP model. The NN prior mean function takes the selected descriptors (D) as the input vector and predicts the optimization target as the output. We use a mean squared error loss function and the Adam optimizer61 to train the NN. Additionally, we use L1 and L2 regularization on the weights of the NN to prevent overfitting the model. Once the NN is trained, we obtain the prior mean function for the GP. The NN is trained using the initial set of observations and then kept fixed through the material discovery process. We refer to this physics-based GP model as the “GP-NN” model. An advantage of employing a NN prior mean function over a 0-prior mean function is the ability to harness available information through a model trained on the input data. Additionally, the NN prior creates a predictive model of the space with just a few observations and therefore, improves the BayesOpt performance of the GP-NN surrogate model. However, the limitation of such a model is that it requires some amount of initial data to train off of.
Finally, we consider that two materials are similar if their physical descriptors (D) are similar. Here, we measure the similarity using a 5/2 Matérn kernel. Formally, a 5/2 Matérn kernel is used to estimate the covariance (k(x(1)D,x(2)D)) for the property descriptors representing the material design choices (see eqn (4)).
![]() | (4) |
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3mh01474f |
‡ These authors have contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2024 |