Daniel
Schwalbe-Koda
*a,
Daniel E.
Widdowson
b,
Tuan Anh
Pham
a and
Vitaliy A.
Kurlin
*b
aLawrence Livermore National Laboratory, Livermore, CA, USA. E-mail: dskoda@llnl.gov; vitaliy.kurlin@gmail.com
bUniversity of Liverpool, Liverpool, UK
First published on 27th October 2023
Zeolites are inorganic materials known for their diversity of applications, synthesis conditions, and resulting polymorphs. Although their synthesis is controlled both by inorganic and organic synthesis conditions, computational studies of zeolite synthesis have focused mostly on the design of organic structure-directing agents (OSDAs). In this work, we combine distances between crystal structures and machine learning (ML) to create inorganic synthesis maps in zeolites. Starting with 253 known zeolites, we show how the continuous distances between frameworks reproduce inorganic synthesis conditions from the literature without using labels such as building units. An unsupervised learning analysis shows that neighboring zeolites according to two different representations often share similar inorganic synthesis conditions, even in OSDA-based routes. In combination with ML classifiers, we find synthesis-structure relationships for 14 common inorganic conditions in zeolites, namely Al, B, Be, Ca, Co, F, Ga, Ge, K, Mg, Na, P, Si, and Zn. By explaining the model predictions, we demonstrate how (dis)similarities towards known structures can be used as features for the synthesis space, thus quantifying the intuition that similar structures often share inorganic synthesis routes. Finally, we show how these methods can be used to predict inorganic synthesis conditions for unrealized frameworks in hypothetical databases and interpret the outcomes by extracting local structural patterns from zeolites. In combination with OSDA design, this work can accelerate the exploration of the space of synthesis conditions for zeolites.
On the other hand, computational design of inorganic synthesis conditions for zeolites has not yet achieved the same impact as OSDA design. Despite their promise in controlling active site distribution,20 phase selectivity,21 Si/Al ratio,22 morphology,23 or lowering the cost of syntheses,24 selection of inorganic conditions capable of synthesizing existing and novel zeolites is not easily modeled.25 Recent progress in quantifying the role of inorganic synthesis conditions in zeolites includes: coupling machine learning (ML) and literature extraction;26,27 obtaining structure-synthesis correlations from synthesis routes;21,28 predicting effects of inorganic cations in heteroatom distributions;17,20 or using ML to control composition and particle sizes from OSDA-free syntheses.22 Nevertheless, their reliance on reported data prevents them to propose inorganic conditions for the synthesis of novel or hypothetical frameworks. Whereas some inorganic synthesis-structure relationships can be derived from building units27,29,30 or alternative structural descriptors,28 automatically screening for new structures in hypothetical zeolite databases requires bypassing human-crafted labels such as building units. Furthermore, although graph-theoretical methods can detect composite building units (CBUs) in arbitrary structures, their computational cost may be prohibitive when exploring large datasets. Data-driven methods based in the topology of the structure also provide information on key factors that govern kinetics of zeolite crystallization,31,32 but do not immediately inform their synthesis conditions. Finally, aggregate framework information such as density-energy plots33,34 or local interatomic distances35 provide few correlations between different inorganic synthesis conditions and targeted frameworks, which motivate new data-driven approaches to synthesizability prediction.36 Thus, advancing towards a priori discovery of novel zeolite frameworks requires developing methods to: (1) uncover new synthesis-structure relationships in zeolites; (2) efficiently explore the inorganic synthesis space of zeolites; and (3) bypass the absence of labeled data in hypothetical zeolite databases.
In this work, we correlate inorganic synthesis conditions to zeolites using structural invariants that are independent of a unit cell and preserved under translations and rotations of a structure. In particular, two invariants are used to quantify distances between zeolites: the well-known Smooth Overlap of Atomic Positions (SOAP),37 and a newer method for comparing periodic crystals, the Average Minimum Distance (AMD)38 derived from the Pointwise Distance Distribution (PDD).39 The PDD is independent of a unit cell, continuous under small perturbations, is theoretically complete for generic crystals, and distinguished all periodic crystals in the Cambridge Structural Database. Importantly, it only requires a fast nearest neighbor search,40 and thus can be computed with low computational cost compared to graph-based approaches or more expensive representations. We show that, for both invariants, structural distances between zeolites can be used to predict inorganic synthesis conditions and recall a comprehensive dataset of synthesis conditions from the literature. Then, we demonstrate that unsupervised and supervised machine learning (ML) methods can be used to create structure-synthesis relationships for zeolites independently from OSDA design (see Fig. 1). Finally, we propose inorganic synthesis conditions to realize hypothetical frameworks based on distances toward structures whose synthesis is known, thus proposing interpretable synthesis-structure models to guide the synthesis of new zeolites.
To construct synthesis-structure maps in zeolites, we first calculated the distance between two idealized frameworks, as extracted from the IZA database, by comparing their AMD vectors (see Methods). Later, to demonstrate that the synthesis similarity hypothesis is not specific to the AMD, we show that structural similarity computed using SOAP also predicts inorganic synthesis conditions in these materials. To test the synthesis similarity hypothesis, we first computed the distance matrix between 253 known frameworks (denoted using their three-letter code, see Methods) in the International Zeolite Association (IZA) database and then performed a qualitative analysis of the results. We found that the AMD values correlated weakly with differences of density and with the SOAP distance between structures, but showed almost no correlation with graph-based distances from previous work41 (Fig. S1†). This can be understood by distorting a given framework without breaking covalent bonds, which leads to different structural fingerprints but equal connectivity. As such, graph and structural distances may be complementary in nature and can be used to model different phenomena.41 Moreover, we noted that some zeolites sharing the lowest distances according to the AMD have been synthesized together, as competing phases, intergrowths, or belonging to the same zeolite families (see Table S1 in the ESI†). Recovering pairs of structurally similar frameworks such as ITH-ITR,42ITG-IWW,43SBS-SBT,24MEL-*SFV,44 or MWF-PAU45 at low distance already suggests that the similarity values reproduce qualitatively some intuitive patterns observed in zeolite synthesis. To generalize this finding, we charted a map of zeolite structures based on their distances. Fig. 2 shows the minimum spanning tree created by converting the AMD distance matrix into a graph with weighted edges. Although the tree shows discrete connections between first-nearest neighbors and may not offer a complete picture with respect to outliers (see Methods), it facilitates the visualization of the results and may provide insights about synthesis-structure maps. Even without considering synthesis labels of the data in Fig. 2, known relationships between zeolites emerge naturally from the structural tree map. Several zeolites known for their similar building patterns are clustered together in the minimum spanning tree, demonstrating that their AMD values capture the space of zeolites without learnable features. Examples of such clusters include the ABC-6 zeolites, structures containing lov building units, six-membered rings frameworks (e.g., GIU cluster), Ge- or boron-containing zeolites (e.g., BEC or IRR and SFN-SSF branch, respectively), to name a few (see also Fig. S3 for a visual guide†). Similarly, structural outliers within the IZA dataset such as the low-density RWY or JSR, or interrupted frameworks such as -CLO, -SYT, and -ITV tend to cluster together, as their distances to all other zeolites is high (Fig. S2†). At the same time, other interrupted frameworks produced in different synthesis conditions, such as -IRY, -IFU, or -IFT synthesized with germanium, appear along with other germanosilicates in the synthesis map, not only with other interrupted frameworks. This ability to traverse the structural space in a continuous way allows drawing non-obvious connections that may be overlooked by building units alone. For example, the SOD and LTA zeolites may be intuitively regarded as similar because of their co-appearance in some synthesis routes and shared building unit sod, but they are not closely located in the graph from Fig. 2. Further analysis of the neighborhood of SOD indicates that this zeolite is instead closer to the FRA, PTT, DOH, and LOS frameworks. The first two explicitly contain the sod building unit, and the latter two are 0-dimensional frameworks analogous to SOD. All have higher framework density (∼17 T/1000 Å3) compared to LTA (∼14 T/1000 Å3), and can be accessed in synthesis routes similar to those typical from SOD. This observation illustrates how the data-driven analysis can be effective in drawing correlations between structural pairs that would not be otherwise obvious, given the diversity of zeolite structures. Beyond the SOD-LTA pair, several other similar zeolites, which are non-neighbors in the tree from Fig. 2, can be rationalized with the use of our distance, including the OFF-LTL, GME-AFI, and UTL-PCR-OKO (see ESI for an extended discussion†). Outliers can also propose new correlations not previously observed in the synthesis of zeolites. For example, the MEI framework lies within the SBS, SBT, SAO, SBE, and AFS cluster despite not having common CBUs with any of these zeolites. MEI also exhibits unusual 3- and 7-membered rings not seen in any of the zeolites in this group. However, the MEI structure was resolved by realizing the connection between its structure and its AFS counterpart, particularly in the presence of secondary building units with 3-fold point group symmetry connected either directly to each other (AFS building scheme) or through a 3-membered ring (MEI building scheme).46 A similar observation was also the key to characterize the STA-1 (SAO) zeolite47 and rationalize the selection of inorganic synthesis conditions for PST-32 (SBT) and PST-2 (SBS/SBT) as aluminosilicate zeolites.24 With this correlation, future computational investigations can help determine whether inorganic synthesis conditions play a role in directing these specific building patterns48 and inform the synthesis of structures such as SAO and SBE as aluminosilicates.
Despite the usefulness of the tree map in connecting zeolites with similar synthesis conditions, the visual analysis cannot determine whether the map consistently provides new insights on the synthesis of zeolites. To improve this qualitative analysis, we performed a hierarchical clustering of the data to quantify whether the structural distances cluster the data according to the literature synthesis conditions (see Methods). The dendrogram of AMD values (Fig. S5 and S6†) shows how zeolites are related to each other based on distances, thus providing a more quantitative view of the minimum spanning tree in Fig. 2. Then, to create labels for synthesis conditions, we started with a dataset of extensive synthesis conditions extracted from the zeolite literature from Jensen et al.49 After augmenting the data with frameworks not typically reported in publications, such as those found as minerals, we analyzed the frequency of occurrence of each synthesis condition for each framework. Although the initial dataset had information on both organic and inorganic conditions, we disregarded the OSDAs when labeling the data, thus assuming that inorganic and organic conditions can, to an extent, be predicted independently of each other. Furthermore, given the scarcity of data for some synthesis conditions, we focused only on the 14 inorganic conditions that have been used to synthesize at least 10 zeolites, namely Al, B, Be, Ca, Co, F, Ga, Ge, K, Mg, Na, P, Si, and Zn. Finally, we verify whether flat clusters formed by points with a maximum distance of each other share the same positive labels. This intuition is quantified by computing the homogeneity between data points given clusters formed by a given distance threshold50 (see Methods). If all clusters had only positive labels, their homogeneity would be 1, whereas zero homogeneity indicates perfect mixing of positive and negative labels. Fig. 3a shows that clusters with at least one positive data point tend to become more homogeneous as the distance threshold decreases. This supports the qualitative view that structures considered similar according to the AMD values also share similar synthesis conditions more often than not. On the other hand, as clusters become larger and the increasingly dissimilar structures are grouped together, the homogeneity decreases. Whereas the distribution of labels for some inorganic agents such as Al, Si, Be, F, or Na exhibit higher homogeneity at low distances (see Fig. S7†), others such as Co, Mg, or Zn show little predictive power. While this could be partly due to a lower number of data points for these synthesis conditions (see Table S5 for the total number of data points per element†), this could also be a consequence of weaker structure-synthesis correlations. Zeolites synthesized with beryllium, for example, are as scarce as Mg and have less data points than Co or Zn, but can be recalled correctly by the structural similarity tests. This suggests that structural distances computed with the AMD have stronger correlations with certain synthesis conditions than with others.
Fig. 3 Unsupervised learning for inorganic synthesis of known zeolites. (a) Cluster homogeneity of zeolites for selected elements (see also Fig. S7†). (b) Minimum spanning tree of zeolites (Fig. 2) labeled according to frequency of Ge or Na in the synthesis of each zeolite (see also Fig. S4†). Darker (lighter) colors indicate that the inorganic synthesis condition is more (less) frequent in the synthesis of each zeolite. (c) Subset of the zeolite dendrogram for selected regions of the minimum spanning tree. |
To demonstrate that these findings are not limited to the AMD invariant and may be intrinsic to zeolite synthesis, we repeated the experiment by computing the distance between frameworks using SOAP (see Methods). Then, based on this new distance matrix, we repeated the analysis of the cluster homogeneity using the same data. Although the distance values in SOAP vectors are different from AMD (Fig. S1†), there is good agreement between the cluster homogeneities obtained with SOAP compared with those from AMD (see Fig. S8†). Interestingly, the SOAP vectors provides slightly better recall, as quantified by the higher homogeneity, for Co, Zn, and Mg, but slightly worse homogeneity scores for Al, Si, Na, and other conditions. Although the exact value of the homogeneity depends on the choice of threshold, these results demonstrate quantitatively that some synthesis-structure relationships can be established in zeolites using structural distances beyond a single choice of invariant.
Additional investigations of the data explain the patterns in homogeneity obtained above. Fig. 3b shows how the minimum spanning tree can be visualized according to the frequency of certain inorganic conditions in zeolite synthesis (see Fig. S4 for complete results using the AMD distance†). For example, some frameworks realizable with Ge or Na form their own groups in the tree, as also illustrated by the subclusters in dendrograms (Fig. 3c). Indeed, zeolites such as BEC, ISV, ITT or IWR are typical examples of large- and extra-large pore structures synthesized using germanium. Similarly, denser phases such as GIU, FAR, LOSetc. are often obtained in sodium-mediated syntheses. For common synthesis conditions such as silicon, trends can be derived from the visualization of silicon-free routes. The labeled trees from Fig. 2 and S4† show that non-silica zeolites are often located in similar regions of the structure space. Groups formed by zeolites such as NAT, EDI, and THO, or AFO, AEL, AHT show that non-silica zeolites also share structural patterns that may be harder to obtain in silica-based structures.
This unsupervised analysis demonstrates that structurally similar zeolites, according to invariants such as AMD and SOAP, share similar inorganic synthesis conditions. Although zeolite structures contain several outliers and lack true negative data, the structural patterns still provide a strong prior for exploring the synthesis conditions. In particular, as inorganic synthesis conditions can be inferred by the similarity between crystal structures, they can also help downselect structures for zeolites yet to be realized. Finally, although mainly we employed the AMD due to its computational efficiency and well-studied mathematical properties, other strategies could also recover this result from zeolite synthesis, as showcased by the example with SOAP.
The results of the hyperparameter search are summarized in Fig. S9† for classifiers trained with AMD distances, and in S10† for those trained with SOAP distances. Whereas no classifier outperforms the other in all tasks, XGBoost models often show the best values of ROC and PR areas under the curve (AUC) for a variety of synthesis conditions. When the results for the XGBoost classifier are visualized according to all figures of merit at once (Fig. 4a), they demonstrate how the best hyperparameters lead to adequate figures of merit based on results from the validation set (see also Fig. S11, S12, and S16, and Tables S5 and S6†). When evaluated against a held-out test set, the model with best set of hyperparameters still exhibits high ROC and PR AUCs for a variety of synthesis conditions (Fig. S13 for the model with the AMD distance†). Nevertheless, this set of hyperparameters is far from being the only one that performs well in these conditions (Fig. S14 and S15†). As discussed in the analysis using unsupervised learning, the ability to correctly label zeolites whose synthesis contains Co or Zn is smaller than other labels, as indicated by the worse performance of all classifiers in labeling these conditions. However, some synthesis conditions that were not well-predicted by the unsupervised learning method, such as Mg, can now be predicted using XGBoost models, despite its low recall (Fig. S13†). Nevertheless, similar trends were found between classifiers trained with the AMD and SOAP invariants (Tables S5–S7†). Whereas classifiers trained on AMD values show a slightly better performance in predicting synthesizability with Al and Si, most of the other performance differences lie within the error bars. These results show that ML classifiers can predict inorganic synthesis conditions using distances between known zeolites as features, and that these structural similarities can be captured by different feature spaces. This has useful implications, as it bypasses the need to create representations specific for zeolites, and instead uses a set of points in the known zeolite space as references for new synthesis conditions.
To interpret the outcomes of the classifiers, explainability techniques can be used to probe what features most affect a positive (or negative) classification of a zeolite within certain synthesis conditions. Given that the input features are distances between known frameworks, a trained classifier decides how to assign a label to an input structure based on its similarity values. Using the Shapley value method (SHAP) and the classifiers trained on AMD values, we analyze what distances most affect the classification of a zeolite into a given class. As each SHAP value indicates how much each feature affects the probability of classifying a framework into a given class, we compute the Pearson correlation coefficient between the actual feature value and the SHAP value for each one of the inorganic synthesis conditions. This quantifies whether being close to a particular framework (feature) increases or decreases the likelihood of being assigned a positive label. The results for the interpretability of XGBoost classifiers are shown in Fig. 4b (see also Fig. S17 and S18†). As the correlation coefficient between AMD and SHAP values are computed on a per-feature (thus per-zeolite) basis, the nodes from the tree map in Fig. 2 are colored according to these coefficients. In this plot, a negative correlation (in green color) indicates that low distances increase the SHAP value and thus the likelihood of being classified as a positive label. Conversely, a positive correlation (in pink color) with a feature indicates that a given zeolite is more likely to be synthesized with a given synthesis condition if it is distant from the featurizing structure. The results not only support the observations highlighted by the unsupervised learning methods, but also complement them with new insights. For instance, zeolites synthesized with Ca and K have a wide overlap of positive and negative correlations (Fig. S18†), possibly due to the clustering of minerals in the tree map. There is also an overlap between boron-containing zeolites and germanium-containing zeolites, especially in the ISV branch. This result could be interesting if validated in practice, especially if the use of boron could help with the removal of Ge from the synthesis of certain zeolites. The central branch characterized by Ge-containing zeolites (such as BEC, ISV, IRR, ITT, see Fig. 2) also have features that correlate with F or Mg, but not Al or Ca. On the other hand, Be-containing zeolites are often complementary to Si-containing zeolites, as the former are found only in specific clusters or outliers in the tree map. This further supports the fact that the classifiers are able to obtain correlations beyond existing heuristics, thus providing data-driven ways to guide inorganic synthesis in zeolites.
As demonstrated in this work, zeolites in the neighborhood of known frameworks are likely to share similar synthesis conditions as those known structures. Thus, downselecting frameworks for given synthesis conditions can benefit from the unsupervised and supervised methods developed here. This approach can be used in combination with previous “synthesizability descriptors” of zeolites, such as local interatomic distances35 or other data-driven predictions.53 However, we chose to evaluate them independently, as these synthesizability predictions do not take into account that certain known frameworks may be considered “unfeasible” depending on the synthesis conditions.34,54,55 For instance, structures containing three-connected rings, such as those with building units lov or vsv, could be ranked as “unsynthesizable,” despite being achieved with beryllium or borogermanate conditions. Thus, to propose synthesis conditions for zeolites, we evaluated all hypothetical frameworks for all synthesis conditions using an ensemble of 100 binary classifiers per inorganic condition (see Methods). As each classifier is trained on different negative data splits, the resulting classification varies for each model, allowing us to assess the degree of agreement between the models. By taking the average of the predictions, we obtain the agreement of the classifiers regarding the feasibility of the given pair of zeolite and synthesis condition.
Fig. 5a depicts the distribution of hypothetical zeolites with Si-based recipes in the neighborhood of LTA zeolite. As all distances between known and hypothetical zeolites had been already computed, we can use both the distances and the class probabilities as criteria for navigating the space of hypothetical structures. This navigation using reference materials instead of features facilitates the selection process and can also inform their synthesis. For example, Fig. 5a and b illustrate two different hypothetical zeolites in the neighborhood of LTA. Although both have low distance towards LTA (compare with dendrogram in Fig. S6†), structure #308,105 is predicted to be more likely to be synthesized as a silicate than #313,030. Both contain the lta and sod cages characteristic of the LTA zeolite, but differ by the presence of a second cage similar to sod, shown in Fig. 5b. Whereas this new building unit resembles an expanded sod cage with distorted six-membered rings in #308,105, hypothetical framework #313,030 shows a new cage, formed by the merging of two sod cages, not seen in known zeolites. This increased distance towards known structural patterns drives the prediction of feasible synthesis using Si as unlikely, even when the distance towards the LTA zeolite is lower. This example shows how the combination of structural distances and classifier predictions facilitates the exploration of hypothetical zeolites using reference structures.
Beyond the exploration of the zeolite space, the models also uncover existing and new synthesis-structure relationships. Fig. 5c shows three examples of hypothetical frameworks predicted to be synthesized using three different elements: Be, Ge, and K. To obtain these frameworks, we filtered only frameworks within densities of 14 and 17 T/1000 Å3 that are predicted to have 100% probability of synthesis with the given element. Then, we ranked the frameworks by their relative energy. Despite not using explicit labels on the CBUs, the supervised learning models recovered the known heuristics of building units and inorganic synthesis conditions. For instance, framework #261,338, predicted to be synthesized in presence of Be, is formed mostly by lov building units, as found in other Be-zeolites such as RSN, LOV, or NAB. This same framework is predicted to be unlikely as a silicate, possibly following the trends seen in JSR or NPT structures. Hypothetical zeolite #64,550, predicted to be synthesized with germanium, also shows features similar to known ones. In addition to its three-dimensional pore structure, with 12 × 12 × 10 intersecting pores, the structure shows the d4r CBU typical of other structurally similar germanosilicates, such as POS or UOV, but with 7 symmetrically inequivalent T sites. Finally, one unrealized framework predicted to be synthesized with potassium is structure #303,768. Although this hypothetical structure does not exhibit typical CBUs, the local structures similar to d8r CBUs are predicted to be favored by K, in analogy with similar relationships in known zeolites. This demonstrates how data-driven models can not only recover known relationships between CBUs and inorganic conditions, but also propose new synthesis-structure relationships in zeolites based on distance patterns between known structures. When used to analyze the entire space of hypothetical frameworks, the models show that the distribution of predicted inorganic synthesis conditions is uneven across the space of zeolites (Fig. S24†). For instance, whereas about 27% of all known zeolites can be synthesized with germanium, according to the literature dataset we used in this work, only 8% of the hypothetical zeolites are predicted to being synthesizable under Ge conditions with an agreement of at least 80%. Similarly, the space of hypothetical structures is surprisingly lacking in structures predicted to be synthesizable with sodium. While about 45% of all known structures have at least one sodium-based synthesis, 17% of hypothetical structures are predicted to be realizable with Na given the 80% threshold probability. As most enumerated datasets are often created without considering synthesis conditions,56 comprehensive enumerations may introduce biases in structures that do not reflect the space of zeolite synthesis typically considered in practice. Thus, in combination with OSDA design17 and property screening,57,58 our methods to predict inorganic synthesis conditions in zeolites may help in synthesizing unrealized frameworks with targeted properties or formulating additional databases of hypothetical structures.
The ambiguity of crystal representations was theoretically resolved for all periodic point sets in any dimension by the complete isoset60 invariant. We define below the computationally faster Pointwise Distance Distribution (PDD) invariant, which distinguished all (more than 670000) periodic crystals in the Cambridge Structural Database (CSD) through more than 200 billion pairwise comparisons within two days on a typical desktop computer.
Fix a number k ≥ 1 of atomic neighbors. Our experiments on zeolites and the CSD used k = 100. Let S be a periodic set with a motif M of points p1, …, pm. For each point pi, write down the sequence of increasing distances di1 ≤ ⋯ ≤ dik to its k nearest neighbors in the full infinite set S without considering any extended cell or cut-off radius. Collect these sequences of distances into an m × k matrix and lexicographically order the rows. If any l of the rows coincide (usually due to extra symmetries), collapse them into a single row with the weight l/m and put these weights into an additional first column (unique rows have weight 1/m). The resulting m × (k + 1) matrix PDD(S; k) is called the Pointwise Distance Distribution, a statistical distribution of rows with weights describing each point's environment. As an example, Fig. 6 shows the computation for a point in the square lattice S whose first k = 8 neighbours have distances 1,1,1,1 (in green) and (in blue).
The Average Minimum Distance AMD(S; k) is the vector obtained by taking the weighted average of the last k columns in PDD(S; k), so AMD is a single vector of k average distances. To compare two AMD vectors of the same length, our experiments used the L∞ (Chebyshev) metric equal to the maximum absolute difference of corresponding coordinates. For a metric on PDDs, we use the Earth Mover's Distance (EMD)61 with the L∞ metric on rows. If any point of S is perturbed in its ε-neighborhood, then PDD(S; k) changes by at most 2ε in the EMD metric. If a periodic set S is generic, which is achievable by almost any perturbation of atoms, then S can be reconstructed from the number m of motif points, a (basis of a) lattice Λ and PDD(S; k) with a known upper bound on k. For the details on these results, see Definition C5 and proofs of Theorems 4.3 and 4.4 in the extended version of ref. 62.
Within this work, a metric is defined as a distance function d:χ × χ → [0,+∞) that satisfies the following axioms: (1) d(x, x) = 0 and d(x, y) > 0, ∀x ≠ y; (2) d(x, y) = d(y, x); and (3) d(x,z) ≤ d(x,y) + d(y,z), ∀x,y,z ∈ χ3. Generally, weaker concepts of distance relax the first constraint to d(x, y) ≥ 0. “Distances” between crystal structures, therefore, refer to distances between structural invariants (e.g., vectors) that are independent of a unit cell and is preserved under rotations and translations of the crystal structure.
It has been reported that local structural fingerprints such as SOAP may be unable to distinguish between certain atomic environments.64 While this may not be a problem when comparing zeolites, especially as cutoffs become larger,53 degeneracies in the descriptor space can limit the accuracy with which local environments — and, therefore, structures — can be distinguished. Specifically, as Pozdnyakov et al.64 demonstrate examples of degenerate manifolds in systems with typical tetrahedral coordination such as methane or silicon, it could be possible to find similar degeneracies in four-connected zeolite networks. Nevertheless, this method has also been widely successful in charting the space of materials, including other four-connected networks,65 and is not expected to interfere substantially with the ML results in this manuscript.
Following the notation from the IZA, known zeolites are named in this manuscript according to their three-letter code in bold typeface. Known CBUs are represented with their three-letter code in lowercase and italic typeface.
In the literature analysis, a zeolite is classified as having a certain synthesis condition when at least 25% of its synthesis recipes exhibit that condition (excluding OSDAs). This label is used as a categorical variable when performing the classification task.
The dendrogram of known zeolites was produced by creating a linkage matrix from the distance matrix using the Ward algorithm as implemented in SciPy (v. 1.10.0).68 The resulting clusters in Fig. S6† were obtained by forming flat clusters with the maximum AMD distance of a given threshold.
The homogeneity of the clustering was computed by calculating the Shannon entropy of flat clusters created with a given threshold,50 as implemented in scikit-learn (v. 1.2.0).69 As the literature dataset is not balanced and lacks true negative points, the homogeneity was only computed for clusters containing at least one positive data point. This ensures that a large homogeneity corresponds to recall of positive data points, which prevents biasing this metric in imbalanced datasets.
Dimensionality reduction was performed using UMAP,70 as implemented in the umap-learn package in Python (v. 0.5.3). The 2D UMAP plot was produced by comparing hypothetical frameworks using the cosine distance of their normalized distances to IZA structures, and using 10 neighbors as parameter.
Train-validation-test sets were created starting with a 60-20-20 ratio, respectively, then subsampling the training set to have an equal number of points with positive and negative labels. Although techniques such as reweighting or resampling could have been employed to obtained balanced training sets, removing data points is a simple approach that prevents classifiers from treating negative data as “true negative”, resembling positive-unlabeled learning strategies.
Hyperparameter optimization of synthesis classifiers was performed using a grid-search method over relevant spaces of hyperparameters for logistic regression, random forest, and XGBoost71 methods. The full range of hyperparameters investigated in this hyperparameter search is shown in Tables S2–S4,† following the notation in the scikit-learn (v. 1.2.0) and xgboost (v. 1.7.5) Python packages. Model performances were compared using the same dataset splits, and the best model is selected according to its validation performance. The results on the paper showcase the performance on held-out test data. While training errors are always smaller than held-out data, the good performance of the models in the validation and test sets suggest their generalization power is not being degraded by overfitting.
One of the best models to classify synthesis conditions of zeolites was the XGBoost model with the following hyperparameters: colsample_bytree = 0.5, learning_rate = 0.1, max_depth = 6, min_child_weight = 1, n_estimators = 200, subsample = 0.5. This model and set of hyperparameters showed good performance across a range of inorganic synthesis conditions, as evaluated by the accuracy, precision, recall, F1 score, area under the receiving operator characteristic curve (ROC AUC), and area under the precision-recall curve (PR AUC). In particular, the best model was selected to maximize the ROC AUC and PR AUC for the balanced classifiers. As a comparison, the performance metrics and their baselines of unbalanced classifiers — thus trained on dataset splits with an uneven number of positive/negative labels — are shown in Fig. S13.†
Explainability of the classifiers was computed using the Shapley value method (SHAP)72 under the TreeExplainer framework,73 as implemented in the shap Python package (v. 0.41.0). The interventional feature perturbation method was used without limit for the tree explainer. Then, correlations between the SHAP values and the distance features were computed for each synthesis condition. To ensure that the correlations are not artifacts of particular train splits, we report the average correlation obtained from an ensemble of 100 XGBoost classifiers trained on splits with different negative data points.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00134b |
This journal is © The Royal Society of Chemistry 2023 |