Anthony
Onwuli
a,
Ashish V.
Hegde
a,
Kevin V. T.
Nguyen
a,
Keith T.
Butler
*b and
Aron
Walsh
*ac
aDepartment of Materials, Imperial College London, London SW7 2AZ, UK. E-mail: a.walsh@imperial.ac.uk
bDepartment of Chemistry, University College London, London WC1H 0AJ, UK. E-mail: k.t.butler@ucl.ac.uk
cDepartment of Physics, Ewha Womans University, Seoul 03760, Korea
First published on 12th September 2023
The traditional display of elements in the periodic table is convenient for the study of chemistry and physics. However, the atomic number alone is insufficient for training statistical machine learning models to describe and extract composition-structure–property relationships. Here, we assess the similarity and correlations contained within high-dimensional local and distributed representations of the chemical elements, as implemented in an open-source Python package ElementEmbeddings. These include element vectors of up to 200 dimensions derived from known physical properties, crystal structure analysis, natural language processing, and deep learning models. A range of distance measures are compared and a clustering of elements into familiar groups is found using dimensionality reduction techniques. The cosine similarity is used to assess the utility of these metrics for crystal structure prediction, showing that they can outperform the traditional radius ratio rules for the structural classification of AB binary solids.
Significant advances have been made in the statistical description of chemical systems with the application of supervised, unsupervised and generative machine learning (ML) techniques.5–7 A critical factor in the performance of such ML models for chemical systems is the representation of the constituent elements. The atomic number of an element can be augmented or replaced by a vector that may be built directly from standard data tables, trained from chemical datasets using a machine learning model, or even generated from random numbers. Such representations can be categorised as local (vector components with specific meaning) or distributed (vector components learned from training data). These have been used to build powerful ML models for property prediction based on composition alone.8–11
Perhaps the simplest local representation is one-hot encoding where a binary n-dimensional vector v is used to categorise the atomic number of the element, e.g. H can be represented as [1000…] and He as [0100…]. A single component is ‘hot’ for each element, thus providing an orthogonal and sparse description. A selection of other common representations from the literature is given in Table 1.
In this study, we are interested in the latent chemical information that can be distilled from such high-dimensional element representations. We consider the fundamental concept of element similarity, which can be defined here as the distance or correlation between elemental vectors. We explore various metrics and then apply them to data-driven structure classification for the case of binary solids. The underlying tools have been combined into an open-source and modular Python package ElementEmbeddings to support future investigations.
The Magpie12 representation is a 22-dimensional vector. It is a local representation where the vector components have specific meaning as they are built from elemental properties including atomic number, effective radii, and the row of the periodic table. The Mat2Vec14 representation is a 200-dimensional vector distributed representation built from unsupervised word embeddings18 of over 3 million abstracts of publications between 1922 and 2018. In contrast, the atomic weights from a crystal graph convolutional neural network trained to predict the formation energies of crystalline materials are used to generate the 16 dimensional MEGnet15 representation. The Random200 representation is simply a 200-dimensional vector generated randomly for each element, employed here as a control measure. Each vector component is generated from the standard normal distribution, .
The actual vectors were collected from various sources: the Magpie, Olinyk and Mat2Vec representations were obtained as csv files from the CBFV repository;19 the Matscholar and MEGnet16 were obtained from the lrcfmd/ElMD repository;20 the SkipAtom embeddings were obtained from the lantunes/SkipAtom repository; NumPy21 was used to generate the Random_200 vectors. We found that the original Oliynyk csv file had 4 columns with missing values: Miracle_Radius_[pm]; crystal_radius; MB_electronegativty; Mulliken_EN. For Miracle_Radius_[pm], we used the mode to impute the missing values and for the other 3 columns, we used knn-imputing with the default parameters in scikit-learn.22 The choice of imputation was such that the overall distribution was preserved. All embedding vectors used in this work have been standardised prior to analysis.
(1) |
For p = 2, we obtain the Euclidean (or L2) distance which is the length of a line segment connecting A and B:
(2) |
(3) |
(4) |
Element separations are plotted for Euclidean and Manhattan distance in Fig. 1 and 2, with other measures shown in the ESI.† The elements are ordered in increasing atomic number along the x-axis and decreasing atomic number along the y-axis. This cuts across the groups in the periodic table. The leading diagonals in the distance plots are zero-valued as they correspond to d(A,A). The lighter blues correspond to elements whose vector representations are close to each other within the chosen metric space. These elements can be interpreted as similar to each other. Stripes are seen for the nobel gas elements, such as Kr and Xe, which are very different from the neighbouring halogens and alkali metals. On a visual basis, the global structure of the heatmaps appears similar for the Euclidean and Manhattan distances, with the main difference being the absolute scale of the distances. Less structure is seen for the Random_200 vectors, as expected for this control representation.
Fig. 1 Map of the pairwise Euclidean distance between element vectors for four representation schemes. The elements are ordered in increasing atomic number along the axes from 1 (H) to 83 (Bi). |
Fig. 2 Map of the pairwise Manhattan distance between element vectors for four representation schemes. |
Alternatively, we can consider the angle between vectors using the cosine similarity based on the dot product:
(5) |
The Pearson correlation coefficient provides a measure of the linear correlation:
(6) |
Fig. 4 Map of the Pearson correlation coefficient between element vectors for four representation schemes. |
The cosine similarity and Pearson correlation are convenient metrics as both cos(θ) and ρ ∈ [−1, 1]. The resulting heat maps are visually similar, with comparable structure to the distance metrics. Histograms of the values are shown in Fig. S3 and S4.† A skewed distribution is found in each case with the exception of Random_200, which follows a normal distribution by construction. We note that the cosine similarity is scale-invariant as it only depends on the angles between vectors. Some elemental representation schemes may be sensitive to bias in the training data, such as an abundance of certain metal oxides, that produce outliers in vector components. Therefore, we use cosine similarity in later sections.
The first method is principal component analysis (PCA). Here two principal component axes are defined using a linear transformation of the original features that give the greatest variance in the vector components. The PCA, generated using scikit-learn,22 is shown in Fig. 5 with each data point coloured by the group in the periodic table.
Fig. 5 Two-dimensional projection of four element representations using principal component analysis. |
The second approach is t-distributed stochastic neighbour embedding (t-SNE). Unlike PCA, this algorithm is a nonlinear dimensionality reduction technique that can better separate data which is not linearly separable. Here a probability distribution is generated to represent the similarities between neighbouring points in the original high-dimensional space and a similar distribution with the same number of points is found in a lower-dimensional space. The t-SNE, also generated using scikit-learn,22 is shown in Fig. 6 with each data point coloured by their group in the periodic table.
We observe that the element representations, with the exception of the random vectors, possess an insightful structure in the reduced dimensions, Fig. 5 and 6. The lanthanoid elements cluster together in the non-random representations independent of the choice of dimension reduction technique. In most of the representations Sr, Ba, Ca tend to group closely together, which reflects their common application in substitutional mixtures, for example in tuning ferroelectric solid-solutions. Interestingly the learned, distributed representations pick up some similarities, which are obvious to a trained chemist, but are not captured in the local Magpie representation, such as the similarity between Bi and Sb. In the Magpie representation, H tends to be considered more of an odd-one-out element, at the periphery of the distributions, whereas in the distributed representations it tends to be clustered with other elements, reflecting how it has been observed in training data from crystals such as HF and LiH.
The radius ratio rules were developed to rationalise the local coordination and crystal structure preferences of ionic solids.27 In this model, the coordination number of a cation is determined by the balance between the electrostatic attraction (cation–anion interactions) and repulsion (anion–anion interactions). A geometric analysis predicts that 8-fold (cubic) coordination should be obtained when the radius ratio ρ = rcation/ranion falls in the range 0.732–1.000. A 6-fold coordination environment is predicted for 0.414 < ρ < 0.732, while 4-fold coordination is predicted for 0.225 < ρ < 0.414. For binary AB solids, these regimes are typified by the CsCl (8-fold), rocksalt (6-fold), or zinc blende/wurtzite (4-fold) structures. While it is accepted that there are many cases where these rules fail, especially in the lower radius ratio regime,28 they are still commonly taught in undergraduate programs due to their instructive nature.
To assess the utility of the various element embeddings for this problem, we follow the structure substitution procedure proposed by Hautier et al.29 and as implemented in the Python code SMACT>=2.3.30–32 In this approach, the likelihood that a new chemical composition (X) will adopt the crystal structure of a known chemical composition (X′) depends on the substitution probability function p(X, X′). The original pairwise substitution weights were learned from a training set of inorganic materials from the Inorganic Crystal Structure Database.33 However, we instead use the cosine similarity between element representations, i.e. we make an assumption that the preferred crystal structure is the one that maximises cos(X, X′).
Unary substitutions are considered here, i.e. where two compositions differ by one element. This allows us to approximate the probability function to , where Z is the partition function, and λ is the metric for chemical similarity. These are the pairwise substitution weights in the original model.29 In the SMACT implementation, these can be a user-defined, pairwise metric for similarity which here is defined as cos(X, X′). A related procedure has been employed by Wang et al. to predict new stable compounds,34,35 and an extension based on metric learning has been reported by Kusaba et al.36
To obtain a set of binary AB solids that adopt one of the four structure types as their ground-state structure, we queried the Materials Project (version: 2022.10.28)37 using pymatgen.38 The query was carried out using the parameters: formula= *1*1; theoretical=False; is_metal=False. This query returned 494 binary AB solids. We chose to exclude metallic materials to focus on compositions where the bonding should be heteropolar. Some of the materials in this dataset contained polymorphs of the same composition. For example, 83 ZnS entries were returned. The data was filtered by only keeping the polymorph of a composition with the lowest energy above the convex hull as an approximation for relative stability. This filter reduced the dataset from 494 materials to 233. The query data was further filtered by matching the structures to one of the four aforementioned structure types using the structure_matcher module in pymatgen38 with the default parameters.
Our process led to a dataset of 101 unique compounds. The final filter was to check that the species in the remaining compounds could be assigned oxidation states, which led to a final dataset of 100 compounds. Taking the empirical Shannon radii39 for each ion, averaged over coordination environments, the radius ratio rules are found to correctly predict the ground-state crystal structures in 54% of cases. This assessment was performed on 81 of the 100 compounds as Shannon radii are not available for all ions. For instance, oxygen is assigned a −1 oxidation state in AgO (mp-1079720), which has no available radius. The performance is lower than the 66% reported in a recent study of the predictive power of Pauling's rules, and using Pauling's univalent radii, to assign the coordination preferences of metals in a dataset of around 5000 metal oxides.40 The differences likely arise from the use of averaged Shannon radii and sensitivity to the chosen dataset (Table 2).
Element embedding | Accuracy |
---|---|
Random_200 | 58.0% |
Hautier et al.29 | 54.0% |
SkipAtom | 68.0% |
Oliynyk | 75.0% |
MEGNet16 | 73.0% |
Magpie | 78.0% |
MatScholar | 81.0% |
Mat2Vec | 80.0% |
The measure of performance defined here is classification accuracy. It is determined by the number of compositions with correctly predicted ground state structure, via the most probable substitution, over the total number of compositions in the dataset:
(7) |
Fig. 8 Confusion matrices for the classification of binary AB crystal structures for 8 element substitution (similarity) measures. |
We find that representations derived from literature word embeddings (MatScholar and Mat2Vec) have comparable performance with their confusion matrices being almost identical. Both capture similar correlations from the dataset of abstracts on which they were trained. The poorer performance of the original weights from Hautier et al.29 can be attributed to the absence of particular oxidation states, which led to some compositions not being assigned to a structure. This is a limitation of species-based measures as compared to those based on the element identity alone. As materials databases have grown compared to a decade ago, there should be a greater diversity of compounds not included in the original training of these weights, which could extend their functionality.
Finally, we note that while we can not exclude data leakage due to structure environments being present in the training data for some of the chosen element vectors, this particular use case has not been explicitly targeted in the training of the distributed representations.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00121k |
This journal is © The Royal Society of Chemistry 2023 |