Seongmin
Kim
a,
Juhwan
Noh
a,
Geun Ho
Gu
b,
Shuan
Chen
a and
Yousung
Jung
*ac
aDepartment of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon, 34141, South Korea
bSchool of Energy Technology, Korea Institute of Energy Technology, 200 Hyuksin-ro, Naju, 58330, South Korea
cSchool of Chemical and Biological Engineering, Institute of Chemical Processes, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, 08826, South Korea. E-mail: yousung.jung@snu.ac.kr
First published on 8th December 2023
While advances in computational techniques have accelerated virtual materials design, the actual synthesis of predicted candidate materials is still an expensive and slow process. While a few initial studies attempted to predict the synthesis routes for inorganic crystals, the existing models do not yield the priority of predictions and could produce thermodynamically unrealistic precursor chemicals. Here, we propose an element-wise graph neural network to predict inorganic synthesis recipes. The trained model outperforms the popularity-based statistical baseline model for the top-k exact match accuracy test, showing the validity of our approach for inorganic solid-state synthesis. We further validate our model by the publication-year-split test, where the model trained based on the materials data until the year 2016 is shown to successfully predict synthetic precursors for the materials synthesized after 2016. The high correlation between the probability score and prediction accuracy suggests that the probability score can be interpreted as a measure of confidence levels, which can offer the priority of the predictions.
Thus, to reduce the time and cost associated with failed syntheses, efforts to understand the chemistry of materials synthesizability have been made. For example, the use of Goldschmidt's tolerance factor, a heuristic stability metric based on the ratio of ionic radii, was suggested to approximate the stability of double halide perovskites.11 For NASICON-structured materials, machine-learning derived stability rules based on the Na content, elemental radii, and electronegativities were also suggested.12 In addition to these heuristic rules for synthetic accessibility that are domain specific, several thermodynamic quantities obtained from electronic structure calculations have been widely used as a helpful estimate of synthesizability.13 For example, the energy above the convex hull (ΔEhull) was used as an important criterion to identify synthesizable photocatalysts14 and metastable inorganic materials.15 Decomposition enthalpies (ΔHd) were also used as another metric to evaluate solid stability.16 Moreover, to address the computational costs of these first-principles calculations, machine learning (ML) models have been proposed to estimate the materials thermodynamics.17–19 More recently, data-driven approaches were proposed to predict the synthesizability of unknown inorganic crystals based on their structural similarity20–22 or chemical composition23 to the already synthesized materials.
Beyond the synthetic feasibility predictions as briefly described above, a few studies attempted to further suggest synthesis routes for inorganic materials. For example, in constructing the plausible reaction space for a target compound, a favorable synthetic pathway has been suggested based mainly on thermodynamic parameters and some kinetic heuristics.24 Nucleation barriers and phase competition metrics were also used, showing another approach to provide favorable paths for inorganic materials.25 In addition to these thermodynamic-based approaches, several data-driven models have also been proposed to generate precursors and synthetic conditions (e.g. heating temperature and time) to synthesize the target materials using text-mined meta-datasets.26–29 However, outcomes from generative models used in the latter studies29 can contain thermodynamically unstable precursors and do not generally present priority among the results, still remaining a question of which results should be tried experimentally first. In the same vein, they do not inform the measure of confidence for predicted reactions, requiring an additional process by domain experts to screen or rank the generated reaction recipes. While the previous study24 has shown the capability of suggesting intermediate reaction steps with priority for a net (overall) reaction (both target and starting materials are given), a prioritized retrosynthetic prediction of starting materials (precursors) is still under-studied.
This status of inorganic retrosynthetic reaction prediction can be contrasted with molecular synthesis planning where there are a number of models with promising prediction accuracy. Broadly, molecular retrosynthesis prediction models can be divided into two approaches, template-free30–32 and template-based33,34 methods. Using sequence-based or graph-based molecular representations, template-free methods have shown good performance without the requirement of a vast number of chemical reaction rules.30–32 On the other hand, template-based models use manually or automatically crafted reaction templates and rules extracted from reaction datasets.33,34 Since subtle changes in chemical environments can lead to very different reactivities and reaction outcomes, template-based approaches typically require a large number of reaction templates, of the order of 10000 to 20000, which still cannot cover and describe all the reactions. On the other hand, most solid-state inorganic syntheses are performed using a finite list of commercial precursors, and in that sense, synthesis planning in solid-state chemistry may be considered much simpler since the aim would then be to select appropriate precursors from commercially available compounds, only of the order of 100 precursors. This difference suggests that the concepts used in successful organic retrosynthesis models may be borrowed and adapted to address the inorganic retrosynthesis problem. In particular, by noting that most solid-state inorganic syntheses are performed using a finite list of commercial precursors, we envision that the set of popular inorganic precursors used in the literature can be seen as a “template” for inorganic solid-state synthesis, and a similar probability-based template selection model used in organic retrosynthesis can be used in inorganic synthesis planning. This template-based recommendation would remove the unwanted possibility of yielding unrealistic precursor chemicals that do not satisfy charge neutrality or have no CAS registry number, as in some of the existing template-free generative models for inorganic retrosynthesis predictions.29
In this work, we introduce a template-based graph neural network for inorganic synthesis recipe prediction. The model is trained to predict a set of precursors for inorganic crystals by ranking the sets of precursors as probability scores. Temperature for the solid-state reaction is another important parameter in actual experimental synthesis, that is affected by both the target crystals and detailed precursors chosen. Thus, we additionally constructed a temperature prediction model that is sequentially connected to the precursor set prediction model. These two models combined then generate a set of precursors and temperature to produce a target solid compound. Due to the high correlation between the probability score and the prediction accuracy, the proposed model has the key advantage of quantifying confidence of the predictions, which could alleviate the exhausted experimental costs.
Fig. 1 The overview of (a) the formulation of source elements and each precursor template library and (b) the inorganic retrosynthetic model architecture (see also Fig. 5 for more details on ElemwiseRetro) for top-scored based synthesis recipe prediction. |
To categorize the source and non-source elements, we examined the text-mined inorganic reaction database.35 To this end, we assigned metal groups (alkali, alkaline, transition, lanthanide, actinide, and post transition), metalloids, phosphorus, selenium, and sulfur as source elements and the others as environmental elements from the inorganic retrosynthetic point of view. Based on this formulation, we constructed a total of 60 precursor templates from the 13477 curated inorganic retrosynthetic datasets. The detailed procedures for the dataset selection and curation and the precursor template extraction are described in the Computational methods section.
Based on this formulation, for a given target composition, the compound is encoded as a graph whose node features are obtained from a separate pretrained representation of inorganic compounds. Once the representation is fed into the model, the inorganic retrosynthetic model predicts the precursors that can provide all source elements contained in the given target composition using the source element mask, as shown in Fig. 1b. The formulated source element mask enables the model to discriminate the source element (Li, La, and Zr) information from the given compositions (Li7La3Zr2O12). Each source element is separately used in the following precursor classifier which predicts the precursor in the formulated template library. By calculating the joint probability of a set of precursors determined for each source element, the precursor-sets (synthesis “recipe”) are finally predicted as a probability score which can be ranked. The brief and detailed architecture of the proposed model, ElemwiseRetro, is described in Fig. 1b and 5, respectively.
Top-k accuracy (%) | Model | ||
---|---|---|---|
ElemwiseRetro (RandSplit) | ElemwiseRetro (TimeSplit) | Baseline | |
a The accuracies were obtained in a single run, but multiple runs yield similar results. | |||
k = 1 | 78.6 | 80.4 | 50.4 |
k = 2 | 87.7 | 89.4 | 70.5 |
k = 3 | 92.9 | 92.9 | 75.1 |
k = 4 | 94.6 | 94.3 | 77.6 |
k = 5 | 96.1 | 95.8 | 79.2 |
For the synthesis step with several experiments, the synthetic temperature was calculated by averaging. The data with the synthesis temperature less than 300 °C and more than 1600 °C were removed, as they are outliers. For multi-step reaction cases which have more than one heating step, we took the average temperature to represent the overall reaction. Those with high standard deviation data were removed. We note that several other reaction conditions (e.g. sequence of operations, type of mixing device, heating atmosphere, etc.) which might be up to each laboratorial standardized procedure would not be considered; incorporating these conditions is a topic of future work.
Through the aforementioned preprocessing, our final dataset size is 13477 for the precursor set prediction from the targets and 9163 for the synthetic temperature prediction. The whole dataset was divided into training:validation:test (8:1:1) to separate test data from the training process. Fig. 4 shows the coverage of the inorganic reaction data, which measures the ratio included in the final dataset after preprocessing the formulation of source elements and precursor templates, depending on the target types in the inorganic reaction dataset. The most frequent types (oxides, composites, alloys, and phosphates) are highly covered, which represents our inorganic reaction domain. The total reaction coverage from our template-based approach is 91.8%, which would be further developed in the future. Our formulated concepts have the possibility to handle reactions involving most elements and the broad types of popular inorganic materials (e.g. oxides, composites, alloys, phosphates, etc.).
In predicting the retrosynthetic precursors for given inorganic materials, we used source element-wise precursor templates to determine each type of precursor compounds. After thoroughly investigating the whole 13477 inorganic synthetic dataset which was curated by the abovementioned preprocessing, we obtained the 60 lists of the precursor templates (e.g. –CO3 in Li2CO3, Na2CO3, and –OH in LiOH, Al(OH)3). Based on these precursor templates, our retrosynthetic model can predict each precursor per one source element within the pre-defined 60 template space.
After training with the precursor templates, the retrosynthetic model classified each source element to infer their precursor template classes. At the end of the classifier, probability score distributions of the precursor templates for each source element were obtained using the SoftMax layer. Using this individual probability, we can automatically compute the joint probability, resulting in the set of precursor outcomes. The probability concept enables the model to derive the most synthetically probable precursors for inorganic retrosynthesis by ranking them as descending probability scores.
After predicting the set of precursors by ElemwiseRetro, both the target and precursors were inputted in the second model for predicting their synthetic temperature (Fig. 5b). The compositions of the target and precursors were converted to inorganic graphs by the aforementioned Roost. To distinguish information between the target and precursors, the atomic nodes in the inorganic graph were only intra-connected within the target and precursor sets, separately. Therefore, the target (or precursor) atomic features were updated only from the surrounding target (or precursor) information. After the MPNN, the attention pooling layer was applied to extract the target and precursor descriptors from the updated target and precursor graphs, respectively. Then the two descriptors were concatenated and fed into the regressor network to predict their synthetic temperature.
The atomic feature vectors learned from ElemNet38 were embedded as the initial node vectors of the inorganic graph. The atomic embedding dimension is 136, which is mapped to 63 dimensions by one linear layer. The stoichiometric weight is concatenated to each mapped atomic vector, resulting in an initial node dimension of 64. We used three MPNN layers to update the node features. The three hidden layers of the prediction network have 512/512/512 nodes. At the end of the prediction network, the SoftMax layer is used in ElemwiseRetro.
Given that the field of inorganic retrosynthesis predictions is still in its early stage, it is important to acknowledge the limitations of our approach and identify opportunities for further improvement. Firstly, our model does not encompass inorganic reactions that involve catalytic compounds, as catalysts cannot be considered both as a source element and a non-source element within the framework of our model. This limitation restricts our ability to predict reactions that include catalytic processes. Secondly, our current model does not predict multi-step reaction pathways for inorganic crystals. This is due to the complexity of precursors that contain more than one source element, in compounds such as BaTi2O5, where two source elements exist within a single precursor. Our formulated approach is designed to handle one-step reactions, which aligns with the majority of the inorganic reaction data used for training. Thirdly, our model does not account for metastable and unstable precursor cases. For practical purposes, our study focuses only on commercially available compounds as starting materials, thereby excluding less stable compounds from consideration. Fourthly, certain types of inorganic crystal compounds, such as oxyhalides, sulfides, carbides, nitrides, and hydrides, are not well-covered by our model, as indicated in Fig. 4. This limitation arises from the definition of the source element space, which excludes carbon and nitrogen due to their overlapping presence in numerous precursor templates. In addition, even though the predicted recipes that do not exist in the dataset were evaluated as inaccurate in this work, they could still be valid. To deal with this, positive and unlabeled (PU) learning would be one solution by enumerating all other recipes that do not appear in the dataset as unlabeled training data in the future. Furthermore, our model can only handle the chemical formula of a target compound, which cannot distinguish the different structures of the same composition, i.e., polymorphs. Since the same composition can have diverse polymorphs in inorganic chemistry, a structure-based retrosynthetic model, as well as a polymorph prediction model, should also be studied. Moreover, since the current model is constructed by a template-based method that cannot predict or suggest the out-domain of the formulated templates, more developed formulations or template-free methods should be studied.
Nevertheless, we expect that the core concept of using element-wise and template-based prediction and the probabilistic method for the estimate of prediction confidence will be a prospective method to solve several future problems of inorganic materials retrosynthesis. In this work, only two element types, source elements and non-source elements, were considered. For further development, more concretized element types (more than three) from the inorganic retrosynthetic view could be a direction to cover the enlarged reaction space which includes catalytic compounds and complex types of precursors. To cover the multi-step reaction space, predicting the number of reaction and heating steps before predicting the precursors and synthetic temperature could be one way.
Footnote |
† Electronic supplementary information (ESI) available: Data distribution, ablation study, and the details of the computational methods. See DOI: https://doi.org/10.1039/d3sc03538g |
This journal is © The Royal Society of Chemistry 2024 |