A. Gilad
Kusne
*ab,
Austin
McDannald
a and
Brian
DeCost
a
aMaterials Measurement Science, Division of the National Institute of Standards and Technology, Gaithersburg, MD 20899, USA. E-mail: aaron.kusne@nist.gov; austin.mcdannald@nist.gov; brian.decost@nist.gov
bMaterials Science and Engineering, Department of the University of Maryland, College Park, MD 20742, USA
First published on 27th September 2024
Autonomous materials research labs require the ability to combine and learn from diverse data streams. This is especially true for learning material synthesis–process–structure–property relationships, key to accelerating materials optimization and discovery as well as accelerating mechanistic understanding. We present the Synthesis–process–structure–property relAtionship coreGionalized lEarner (SAGE) algorithm. A fully Bayesian algorithm that uses multimodal coregionalization and probability to merge knowledge across data sources into a unified model of synthesis–process–structure–property relationships. SAGE outputs a probabilistic posterior including the most likely relationship given the data along with proper uncertainty quantification. Beyond autonomous systems, SAGE will allow materials researchers to unify knowledge across their lab toward making better experiment design decisions.
Knowledge of this SPSPR plays a fundamental role across materials research, whether the research is performed by hand or through an automated or autonomous system. Researchers use knowledge of the SPSPR as a blueprint to navigate the high-dimensional complex search space toward novel and optimized materials and to explore the underlying mechanistic origins of material properties. As a result, an algorithm that properly unifies diverse materials data into SPSPR models may accelerate all these activities, impacting much of materials research. For example, such an algorithm can exploit the SPSPR to dramatically improve prediction accuracy of a target functional property, despite sparsity of data. This improved prediction would then better guide subsequent research, which would in turn boost SPSPR knowledge.
Building the SPSPR blueprint involves combining knowledge of material synthesis and process conditions, lattice structure (and potentially microstructure), as well as the diverse set of functional properties required to meet the technological requirements. This requires integrating data across different instruments and measurement modalities, each dependent on differing physical principles. Additionally, measurements can vary based on instrument calibration, measurement parameter settings, environmental conditions such as temperature and humidity, and each instrument user's measurement process. Even instruments of the same make and model differ based on unique biases, uncertainties, and data artifacts.
As a very common example, researchers often start their search for improved materials with the phase map of the target material system. A phase map (or ‘phase diagram’ for equilibrium materials) visualizes the synthesis–structure relationship. An example phase map is shown in Fig. 1a for the (Bi,Sm)(Sc,Fe)O3 material system.1 Here the phase map relates material composition (the target synthesis conditions) to resulting lattice structure, described in terms of phases, i.e., composition-structure prototypes. The phase map is divided into phase regions – contiguous regions of synthesis-process space (experiments of varying synthesis and process conditions) that result in materials of the same set of phases. The regions are separated by phase boundaries (dashed black lines). Material phase information is predictive of many functional properties. Materials with property extrema tend to occur either within specific phase regions (e.g., magnetism and superconductivity) or along phase boundaries (e.g., caloric-cooling materials). Thus, a materials researcher can use phase maps to guide their studies toward synthesis and process conditions that are expected to produce materials with more promising properties.
Fig. 1b visualizes a (Bi,Sm)(Sc,Fe)O3 SPSPR by combining the phase map with the functional property of coercive electric field magnitude (CEFM).2,3 Circles indicate experimentally characterized materials and circle color indicates measured CEFM. The CEFM is highly dependent on both synthesis conditions and phase, with the highest values occurring with ‘open’ hysteresis loops in the rhombohedral R3c phase region. Additionally, the composition dependence of CEFM significantly differs between phase regions, with greater variation occurring in the R3c and Pnma phase regions than the intermediary region. In general, discontinuities in functional property values may also occur at phase boundaries. Thus, functional properties can be represented as piecewise functions of the synthesis parameters (in this case composition), with each ‘piece’ of the piecewise function associated with a phase region. This allows for significant changes in function behavior from region to region and/or discontinuities to occur at phase boundaries.
For this example, data for materials synthesis and structure are used to build a phase map, and that phase map is then used to guide understanding of target property data. Knowledge is one directional, from structure to functional property. With a proper SPSPR learning algorithm, these diverse data could be combined in a unified model where knowledge of the phase map would improve analysis and prediction of functional properties and vice versa. For example, significant changes in functional properties may indicate a phase boundary and thus improve analysis and prediction of materials structure. Such an algorithm would boost overall materials research prediction accuracy and subsequent research, but such an algorithm has been lacking.
To overcome the dual challenges of a complex, high-dimensional search space and data fusion in SPSPR learning, we present the synthesis–process–structure–property relAtionship coreGionalized lEarner (SAGE). The SAGE algorithm is available as part of the Hermes library, https://github.com/usnistgov/hermes and as a standalone library https://github.com/KusneNIST/SAGE. SAGE is a Bayesian machine learning (ML) algorithm that combines three features: (1) ML-based segmentation of the synthesis-processing space using material synthesis, process, and phase data. Segments are phase regions, and the collection of phase regions forms the synthesis–process–phase map. The synthesis–process–phase map is then used to extrapolate the synthesis–process–structure relationship to new materials, (2) piecewise regression to fit and extrapolate synthesis–process–property relationships, and (3) coregionalization. Coregionalization allows multimodal, disparate knowledge of structure1 and property,2 both gathered across the shared domain of synthesis and processing conditions, to be combined to exploit shared trends. Here the term multi-modal refers to learning from disparate data sources (similar to its use in the common machine learning challenge of learning from text, audio, and image data). SAGE combines these three features to learn the most likely SPSPR model given material synthesis, process, structure, and property data. Multi-modal learning arises from exploiting both structure and functional property data to improve phase mapping (rather than using just the structure data) and both structure and functional property data to improve functional property regression (rather than just using the functional property data). Here the language of probability is used to unify knowledge across multi-modal data with assumptions represented as priors and data combined through likelihoods. Additionally, SAGE's Bayesian framework allows for full uncertainty quantification and propagation.
Much of machine learning focuses on algorithms that provide “point estimate” outputs, i.e., they provide analysis or prediction without uncertainty. Proper uncertainty quantification and propagation requires explicitly expressing uncertainties in all variables and data, and then propagating these uncertainties through all computations to provide the uncertainty in the algorithm's outputs. A set of statistical learning algorithms such as Gaussian process regression were analytically developed to explicitly and properly manage uncertainties.4 Due to the complexity (and wide-spread use) of many algorithms, computational methods are often employed to approximate output uncertainties without significantly changing the main algorithm.5 Alternatively statistical methods such Bayesian inference can be used to build probabilistic models. With Bayesian inference, uncertainties are explicitly expressed and combined with Bayes rule to output the posterior probability (a probabilistic representation of uncertainty) – the probability distribution of the model given the data.6 When analytically intractable, sampling methods such as Markov Chain Monty Carlo (MCMC) can be used to estimate these posterior probabilities.6 Implementing these techniques is made easier through probabilistic programming languages such as Pyro and Turing.7,8 SAGE employes Bayesian inference and MCMC for uncertainty quantification and propagation.
A schematic of SAGE is provided in Fig. 1c. Here SAGE takes in data streams from the material synthesis and processing systems, structure characterization instruments, as well as functional property characterization instruments. Each structure data stream is first processed using a phase analysis algorithm as described below. SAGE then learns the SPSPR from the combined phase analysis data streams and the functional property data streams. SAGE's output SPSPR posterior can be broken down into posteriors over the synthesis–process–structure phase map and the functional properties. These posteriors can then be integrated into either an experiment recommendation engine or a closed-loop autonomous materials laboratory,9 which can guide subsequent experiments and measurements in structure and functional property. For example, an autonomous system could target maximizing knowledge of the SPSPR or optimizing a material for a set of target functional properties.
Each of SAGE's features has a diverse history. The first feature of ML-based phase mapping has a seen the development of an array of algorithms over the last few decades.3,10–14 These algorithms combine two tasks, (1) data analysis: analyzing structure data to identify phase abundances or phase regions and (2) extrapolation: extrapolating phase knowledge from measured materials to unmeasured materials. Data analysis techniques (i.e., phase or phase region identification) include matrix factorization, peak detection, graphical model segmentation, constraint programming, mixed integer programming, and deep learning, among others.3,15–23 For an example of such an algorithm applied to the provided datasets, including a thorough description of these datasets, we refer the reader to ref. 3. Extrapolation algorithms have focused primarily on the use of graph-based models or Gaussian processes (GP).16,17,24,25 (For a brief overview of Gaussian processes, see Section 4.2.6 Gaussian processes) For the present work, we assume the task of structure data analysis is addressed with one of the many available algorithms. We indicate the use of one of these algorithms with the function m(Ds) applied to structure dataset Ds, as described below. SAGE therefore begins with knowledge of phase and focuses on the task of extrapolating phase map knowledge through Bayesian coregionalized synthesis and process space segmentation.
Piecewise function regression algorithms have a much longer history. This includes the common challenge of detecting data discontinuities – also known as jumps or changepoints, which can be generalized to higher dimensions as edges,26 change-boundaries, and change-surfaces. Changepoint detection algorithms are quite diverse, using function derivatives, filter convolution, Bayesian inference, and more recently, adaptive design.27 Common methods for piecewise regression include linear piecewise algorithms and splines. We point the reader to review articles in these fields.28,29 Specifically for GPs, multiple piecewise modeling methods exist30 including the use of the changepoint kernel (below called GP-CP).31
The field of coregionalization developed from geospatial science to learn functions with shared trends over the same physical domain.32,33 Data for each target function is not required to be collected for the same set of points in the input domain.33 For example, if one seeks to learn f1:x → y and f2:x → s, data D1 = {(xk, yk)}Nk=1 and D2 = {(xl, sl)}Ml=1, the set of input locations {xk}Nk=1 and {xl}Ml=1 are not required to correspond to the same locations. Alternative methods for jointly learning related functions include multi-task learning, co-kriging, including multi-task Gaussian processes5,34 as well as constraint programming methods and Bayesian methods.33,35,36 These algorithms focus on exploiting similarities between functions over the full underlying shared domain, assume the set of output functions are similar (e.g., all continuous), and assume that each experiment is characterized similarly. Recent work tackles learning heterogenous sets of functions such as a mix of continuous, categorical, and binary outputs.37 These algorithms assume a correlation between a set of latent functions that contribute to the observed output functions.
Our challenge is unique. While we seek to jointly learn the synthesis–process–structure relationship and synthesis–process–property relationships, the correlation of interest between these relationships is purely that of discontinuities, rather than correlations over the full synthesis–process domain. We assume that phase boundaries indicate potential change surfaces in functional properties, and vice versa. We wish to jointly learn these phase boundaries and utilize them to define piecewise functions for the functional properties, allowing for different property behavior in different phase regions. Prior algorithms fail for this challenge as the synthesis–process–structure relationship and those of synthesis–process–properties are not correlated over the full synthesis–process domain (this is also true for latent function representations). Additionally, SAGE utilizes coregionalization to allow different measurements to be performed at different locations in the shared synthesis and process domain. This is commonly the case when materials synthesis and processing experiments take equal or less time than the measurements or when combining data collected at different times or by different labs.
To the authors' knowledge, the only algorithm that addresses the same challenge is the closed-loop autonomous materials exploration and optimization (CAMEO) algorithm.16 CAMEO first learns phase boundaries from synthesis, process, and structure data and then utilizes this knowledge to define the change boundaries in the piecewise function used to fit and model functional property data. This two-step approach was employed in driving an X-ray diffraction-based autonomous (robot) materials research system in the study of phase-change memory material. The study resulted in the discovery of the current best-in-class phase-change memory material – the first autonomous discovery of a best-in-class solid state material.16 SAGE improves on CAMEO by allowing full Bayesian uncertainty quantification and propagation, thus providing simultaneous information sharing between the structure and property measurements. SAGE jointly solves for the SPSPR to better exploit shared trends across structure and property data and improve SPSPR knowledge. SAGE is offered as a module of CAMEO, i.e., CAMEO-SAGE.
The present data science challenge is generalizable beyond learning SPSPR. One can use SAGE to address the more common issue of having successful and failed experiments across a shared experiment parameter domain. SAGE would then learn and exploit knowledge of the success–failure boundary to improve prediction of properties of either type of experiments. Additionally, SAGE addresses data fusion across instruments, measurement modalities and labs. The common approach to this data fusion challenge is to map data from different sources into the same data space, allowing comparison. For example, data fusion for X-ray diffraction (XRD) measurements from two different XRD instruments requires removing source-based data artifacts including instrument effects that are convolved into the data. To do this, that data must then be mapped from the instrument specific, source-based independent variable space (2θ) to an instrument-free independent variable space (q), while also accounting for differences in finite resolution in 2θ space, absolute intensities and counting times, beam wavelength dispersion, and background signals, amongst other considerations. In general, data mapping to an instrument (also lab, weather, etc.) invariant space requires a significant amount of meta data that is often not available.
An alternative is to independently analyze the data from each source and then combine the derived knowledge across sources. SAGE allows such limited-metadata data fusion. The idea behind coregionalization, as implemented in SAGE, is that the boundaries identified by one measurement method are also boundaries in the other measurement methods – regardless of if those measurement methods are all nominally the same technique (e.g., several different XRD instruments) or different techniques (e.g., an XRD instrument and electrical coercivity measurements). For example, for structure data, one performs phase mapping analysis for each data source and then SAGE coregionalization combines knowledge across sources. A similar benefit exists for functional property data by treating data from each source as a different target property, e.g., coercivity_data_source_1 and coercivity_data_source_2. Additionally, SAGE may be applied to cases where only structure data or only functional property data is obtained.
The contributions of this work are:
• Extending Bayesian coregionalization algorithms to 1-dimensional and N-dimensional joint segmentation and piecewise regression.
• Associated constraint programming algorithms for coregionalized joint segmentation and piecewise regression.
• Demonstration of Bayesian algorithms for learning SPSPR in a unified model.
SAGE is a physics-informed (also known as inductive-bias informed) machine learning algorithm.38 A wide array of methods exists for integrating prior physical knowledge into machine learning methods, including engineering descriptors,39,40 latent mappings,12 constrained solution spaces,41 kernels,42 among many others. For example, a physics-informed algorithm was designed for autonomous, closed-loop control over neutron scattering to accelerate characterization of temperature-dependent magnetic structure.43 The authors represent the temperature-dependent structure as a stochastic process with neutron scattering-defined measurement uncertainties as well as a mean function prior defined by magnetics physics. The algorithm resulted in a fivefold acceleration in measurement efficiency. However, no previous algorithms provide the contributions listed above. Such physics-informed methods provide greater performance and lend greater interpretability to the machine learning model – providing more physically meaningful solutions.
While the provided implementation of SAGE is a surrogate model, its framework allows easy modification to embed greater prior knowledge and to increase interpretability. Target functional properties are currently defined through samples of multivariate normal distributions, similar to a Gaussian process. To increase interpretability, users can replace these samples with samples of potentially descriptive parametric models (as well as a parameter that selects between the models). SAGE will then identify the most likely model and posteriors over its parameter values. In this way a user can exploit SAGE's built-in coregionalization of functional property with phase mapping (i.e., enforced SPSPR) to boost data analysis. Additionally, one can modify parameter priors. For example, setting segmentation length scales to a Gamma distribution to increase bias for small or larger phase regions.
As described above, SAGE performs the two tasks jointly of (1) segmenting the synthesis–process domain X into phase regions i.e., a phase map and (2) regression of the functional properties. SAGE tackles both tasks by exploiting shared knowledge (i.e., multimodal knowledge) across the structure and functional property data. We compare SAGE to a set of common algorithms and modified SAGE algorithms as described below (see section Additional models). SAGE's phase mapping capabilities are compared to that of GPC, the modified SAGE algorithm SAGE-1D-PM, and SAGE-ND-PM using only synthesis–process and structure data to segment the synthesis–process space. SAGE's functional property regression capability is compared to that of GPR, GP-CP, and the modified SAGE algorithms SAGE-1D-FP and SAGE-ND-FP using only synthesis–process and functional property data. Here the GP-CP algorithm seeks to perform functional property regression while also identifying change points (i.e., phase boundaries) without access to structure data. SAGE's capabilities for jointly performing phase mapping and functional property regression from synthesis, process, structure, and property data is compared to the CAMEO's regression algorithm which first uses synthesis–process and structure data to identify a phase map and then employs the phase map in piecewise regression for functional property data.
For each experiment, an algorithm is given a subset of materials data – for each challenge, all algorithms are given the same set of data (as visualized by red and black markers for 1D challenges and red markers for 2D challenges in the figures below) – and each algorithm is then used to predict one of or both of (based on its capabilities): (1) the phase map over the synthesis–process domain X, (2) functional properties over X. For 1D challenges this is over a 100-point grid and for 2D challenges a 40 × 40-point grid (see Fig. 2 and 3 for ranges). For instance, GPC takes a subset of data in X and the associated structure data and then predicts the phase map over the associated X grid; GPR takes a subset of data for X and associated functional properties data to provide functional property regression over the X grid.
Phase mapping performance is measured by comparing predicted phase map labels to ground truth (over the grid) using the micro F1 accuracy score. Ground truth for the 1D and 2D cases can be seen in the red and black dashed curves of Fig. 2 and the color-coded values of Fig. 3. Functional property regression performance is quantified by comparing predicted regression models with the ground truth using the typical coefficient of determination R2. A description of these measures can be found in the Methods section. Furthermore, the performance of MCMC-computed algorithms is based on their posterior mean and the performance of variational inference-based algorithms (e.g., GPs) is based on their maximum likelihood estimate (MLE) mean as given in Table 1.
Phase map performance, micro F1 accuracy score [arb. Units] | ||||||
---|---|---|---|---|---|---|
1D challenges | SAGE-1D (post. mean) | SAGE-1D-PM (post. mean | SAGE-1D-FP (post. mean) | GP-CP (max likelihood) | GP classification (max likelihood) | CAMEO prediction |
1D edge case 1 | 1.00 | 1.00 | 0.89 | 0.82 | 1.00 | 1.00 |
1D edge case 2 | 0.99 | 0.89 | 0.99 | 0.90 | 0.90 | 0.86 |
Phase map performance, micro F1 accuracy score [arb. Units] | ||||||
---|---|---|---|---|---|---|
2D challenges | SAGE-ND (post. mean) | SAGE-ND-PM (post. mean) | SAGE-ND-FP (post. mean) | — | GP classification (max likelihood) | CAMEO |
2D edge case 1 | 0.98 | 0.97 | 0.85 | — | 0.98 | 0.94 |
2D edge case 2 | 0.98 | 0.92 | 0.97 | — | 0.93 | 0.53 |
(Bi,Sm)(Sc,Fe)O3 | 0.97 | 0.94 | 0.61 | — | 0.89 | 0.99 |
FeGaPd | 0.95 | 0.93 | 0.13 | 0.99 | 0.96 |
Functional property performance R2 [arb. units] | ||||||
---|---|---|---|---|---|---|
1D challenges | SAGE-1D (post. mean) | — | SAGE-1D-FP (post. mean) | GP-CP (max likelihood) | — | CAMEO prediction |
1D edge case 1 | 0.99 | — | 0.98 | 0.96 | — | 1.0 |
1D edge case 2 | 1.00 | — | 1.00 | 0.98 | — | 0.92 |
Functional property performance R2 [arb. units] | ||||||
---|---|---|---|---|---|---|
2D challenges | SAGE-ND (post. mean) | — | SAGE-ND-FP (post. mean) | — | GP regression (max likelihood) | CAMEO |
2D edge case 1 | 0.88 | — | 0.53 | — | 0.67 | 0.86 |
2D edge case 2 | 0.89 | — | 0.87 | — | 0.62 | 0.67 |
(Bi,Sm)(Sc,Fe)O3 | 0.91 | — | 0.27 | — | 0.84 | 0.87 |
FeGaPd | 0.91 | 0.87 | 0.90 | 0.91 |
For the first edge case, SAGE-1D MLS combines structure and functional property knowledge to outperform GP-CP in predicting both functional property and phase boundary. SAGE-1D's slanted transition at the phase boundary (Fig. 2a) indicates a range of potential phase boundary locations between the two structure data points (range is also indicated by the dotted green lines). SAGE-1D and SAGE-1D-PM have similar performance in identifying the phase boundary location, providing similar posteriors (inset). SAGE-1D employs phase boundary uncertainty to better quantify its regression uncertainty as indicated by the wider confidence intervals.
For the second edge case, SAGE-1D MLS and GP-CP have similar regression performance due to the highly informative functional property data. However, SAGE-1D outperforms SAGE-1D-PM in locating the phase boundary, as it exploits functional property data to greatly narrow in on potential locations. A further comparison between SAGE-1D, GP-CP, SAGE-1D-PM, SAGE-1D-FP, and GP classification are presented in Table 1. Knowledge of the changepoint location is limited to the two nearest data points, either functional property or structure data. As a result, functional property prediction performance is measured outside the range of the two nearest data points.
In Fig. 4 we demonstrate the ND algorithm for the 2D case with 2 structure data sources and 2 functional property sources. Here the first structure data source provides more information for the upper part of the phase boundary and the second source provides more information for the lower part of the boundary. SAGE-ND unifies knowledge across all four data sources to obtain good prediction of both phase map and the two functional properties.
Fig. 5 SAGE-ND applied to (Bi,Sm)(Sc,Fe)O3 dataset, where structure data is more informative of the phase boundaries. Fig. 1 shows the ground truth. (a1) Phase map estimate indicated by color coding with structure data indicated with red squares and phase boundaries indicated by dashed black lines. (a2) Entropy-measured uncertainty in the phase map of (a1), (b1) CEFM estimate with functional property data indicated with red squares. (b2) Variance-measured uncertainty for the CEFM estimate. SAGE-ND utilizes the more informative structure data to identify the phase regions and utilizes this information to better identify the varying CEFM behavior in each phase region. |
For the second materials challenge demonstration, SAGE-ND is applied to learn a SPSPR for a FeGaPd3 composition spread dataset of X-ray diffraction structure measurements and remanent magnetization as shown in Fig. 6. Ground truth phase mapping and remanent magnetization are shown in Fig. 6a1 and 6a2, respectively. SAGE prediction estimates are shown in Fig. 6b1 and 6c1 and uncertainties in Fig. 6b2 and 6c2, respectively.
For both material systems, a comparison of SAGE-ND with SAGE-ND-PM, SAGE-ND-FP, GP classification, GP regression, and CAMEO are shown in Table 1. For the (Bi,Sm)(Sc,Fe)O3 dataset, SAGE-ND provides 97% (or 0.97 out of 1.00) or greater phase mapping accuracy, though not the top accuracy among algorithms. For (Bi,Sm)(Sc,Fe)O3, CAMEO outperforms SAGE-ND by 2% and for FeGaPd, GPC outperforms SAGE-ND by 4%. For functional property predictions, SAGE-ND provides the best (or tied for best) performance.
Better performance for predicting functional properties over phase mapping is to be expected as there is greater information of the phase boundaries from the structure data than the functional property data. Thus, SAGE exploits knowledge from the structure data (knowledge of the synthesis–process–structure relationship) to boost functional property prediction.
While SAGE-ND does not always provide the best results, it does provide proper uncertainty quantification compared to CAMEO. Additionally, if the SAGE assumption that the target functional property behavior is dependent on phase is true for the material system of interest, SAGE exploits this relationship to improve uncertainty quantification using both structure and functional property data compared to methods (e.g., GP methods) that can only utilize either structure or functional property data. SAGE is also the only algorithm which provides a single model for quantifying both prediction and uncertainty for the synthesis–process–structure–property relationship.
Model output estimates and uncertainties can be employed in active learning-driven recommendation engines or closed-loop autonomous systems, to ensure optimum selection of subsequent experiments. For example, the phase map estimate and uncertainty can guide subsequent structure measurements toward improved phase map knowledge while the paired functional property estimates and uncertainties guide materials optimization. With each experiment increasing knowledge of separate portions of the SSPR, SAGE can play a part in unifying knowledge across a research lab toward the discovery of advanced materials.
(1) |
If the loss functions are additive across datasets, we have:
(2) |
One implementation has fs map each point to an integer label associated with a given phase region. The function ms is then required to map the structure data to potential phase region labels similar to those of.3,10–13,15 Alternatively, one may want the overall algorithm to identify phase abundances for each material x. For this case, ms identifies phase abundances and maps x to phase region labels. Abundance regression can then be performed by including abundances in the list of target properties yp.
The Bayesian model presented below can be solved using such an objective function. Here, ds and dp are the negative log likelihood functions:
ds = −ln[p(fs(x, θs)|m(Ds)], | (3) |
dp = −ln[p(fp(x, fs, θp)|Dp,j)], | (4) |
Model 1 provides the general model. One samples the function parameter priors for θs and θp = {θp,j,r} for each j of J functional properties (or function property data source) and each r of R phase regions. fp,j is a piecewise random process with different behavior fp,j,r for each functional property in each phase region, i.e., different kernel hyperparameters for each phase region. fs is used to compute the categorical distribution p(r(x)) of phase regions labels for each point x. p(r(x)) is used to compute the sum log likelihood Ls of structure data observations and identify phase region label probabilities for each functional property observation data point xp. The sum log likelihood of the observed functional properties Lp is computed using these probabilities and the piecewise fp,j. The total likelihood L is then returned, guiding Bayesian inference sampling. The implementations and associated code can be used with an arbitrary number of data sources. Sampling from GPs uses the Cholesky decomposition method to improve MCMC stability.4
After Bayesian inference is run, i.e., each b sample of B total MCMC samples are collected, the Bayesian posteriors for the phase map and functional-properties-describing functions are approximated. Here the categorical distribution describing the phase map is computed by taking meanb[pb(r)], the posterior mean over the sampled categorical distributions. The phase map estimate and uncertainty e are then computed with = argmaxr[pM] and e = entropyr[pM]. Each functional property is described by the posterior multivariate normal distribution N(meanb[fp,j,b], stdb[fp,j,b]) with additional measurement noise meanb[np,j,b].
The SAGE algorithms make use of latent functions. One set of latent functions are used to identify the probabilities of each point x belonging to a specific phase region. These probabilities are then multiplied by an additional set of latent functions describing target functional properties, in effect weighting these second set of functions to bound them to target phase regions. Through this combination of latent functions, one can identify regions in X that may contain significant changes in phase and/or functional properties and may be of interest for further experiments. Statistical analysis of multiple samples of latent functions provides a posterior distribution for phase map and piecewise functional properties.
The presented implementation is developed from that of ref. 25. The functional property in each phase region is represented by an independent radial basis function kernel Gaussian process, with θp including: lr,j kernel length scale, sr,j kernel standard deviation (also known as ‘scale’), and sj measured noise standard deviation. For this work, we assume that sj is the same for property j across all phase regions. For each property, the region-specific functions fp,j,r are sampled from GP(θp,j,r) and then combined using the probabilistic weights p(r) to give the piecewise functions fp,j. fp,j describes the sample mean and nj the sample noise of the multivariate distribution N(fp,j, nj) used to describe a potential generating random process. Data likelihood is then given by p(Dp,j|N(fp,j, nj)).
Example implementation:
We compare SAGE to off the shelf GP algorithms and modified versions of SAGE. We compare SAGE's phase mapping (PM) performance with a version of SAGE which only takes structure data input. For 1D challenges this is Model 4 ‘SAGE-1D-PM’ and for 2D challenges this is Model 6 ‘SAGE-ND-PM’. We compare SAGE's functional property (FP) prediction performance with versions that only take in functional property data, i.e., piecewise Gaussian process regression. For 1D challenges this is Model 5 ‘SAGE-1D-FP’ and for 2D challenges Model 7 ‘SAGE-ND-FP’. For these algorithms that rely on just one input data type, it is expected that for exhaustive data, performance will be high, while for partial data, the joint SAGE model will outperform these models. These additional algorithms are available as part of the SAGE library.
• Model 4, SAGE-1D-PM: this algorithm mirrors SAGE-1D but excludes functional property regression. The algorithm is described the same as Model 2 lines 1–4 and returns Ls.
• Model 5, SAGE-1D-FP: this algorithm mirrors SAGE-1D but excludes the phase mapping loss term. It is thus a 1-dimensional piecewise GP. The algorithm is described the same as Model 2 lines 1–3 and 5–11 and returns Lp.
• Model 6, SAGE-ND-PM: this algorithm mirrors SAGE-ND but excludes the task of functional property regression. The algorithm is described the same as Model 3 lines 1–6 and returns Ls.
• Model 7, SAGE-ND-FP: this algorithm mirrors SAGE-ND but excludes the phase mapping loss term. It is thus an N-dimensional piecewise GP. The mode is described the same as Model 3 lines 1–5 and 7–14 and returns Lp.
• GP-CP; GPR; GPC: the implementations use the radial basis function kernel for regression and the Matern 5/2 kernel and MultiClass likelihood for classification. All use the truncated Newton method for optimization.
• CAMEO – only piecewise regression task. This model follows that of []. A Gaussian random field (GRF) is defined for the material system including both characterized and potentially characterized materials. The GRF is applied to the structure data to segment the material system and that segmentation is then combined with off-the-shelf Gaussian process regression, using different hyperparameters for each phase region.
SAGE was run on a laptop (6 core 2.7 GHz, 32 GB memory, NVIDIA‡ Quadro P620) and runs within a few minutes, e.g., less than 2 minutes for the (Bi,Sm)(Sc,Fe)O3 material system example. All implementations are built to boost performance through parallelization across multiple CPUs by changing the “number of available cores” and “number of chains”. This dramatically accelerates computation. For example, using parallelization across a 100 CPU node allows MCMC samples for each CPU to be reduced by an order of 100. The choices of MCMC sample number indicated below were found to provide convergence in the posterior predictions.
Here we provide initial values or uniform prior ranges for the implementation. If a parameter is not mentioned, it is the default initial value or range of the library used.
• GPs
○ All GP implementations are written in gpflow.45
○ 1D CP-GP: initial length scale = 0.2; initial change point steepness = 100; noise variance = 0.01; max iterations = 10000;
○ 2D GPR: initial lengthscales = 1; noise variance = 0.005 and range [0.001, 0.01]; max iterations = 1000;
○ 1D and 2D GPC: max iterations = 1000;
• All MCMC algorithms
○ Number warmup samples = 100; number samples = 1000; target acceptance probability = 0.8; max tree depth = 5; jitter = 1 × 10−6
○ SAGE-1D, SAGE-1D-PM, SAGE-1D-FP are written in numpyro46 with parameters: sr,j = [0.01, 2.]; lr,j [0.2,1.]; nj = [0.001,0.01]; Change point bounds, i.e., θs = [0.5, 1.]
○ SAGE-ND, SAGE-ND-PM, SAGE-ND-FP are written in numpyro Jax with parameters.
■ SVI initialization of phase map: number of samples = 100000; Adam step size = 0.05;
■ MCMC algorithm: ss = [5.,10.]; ls = [0.1,2.]; sr,j = [0.1, 2.]; lr,j = [1,2.]; nj = [0.001,0.1]; br,j = [−2., 2.]
○ SAGE-ND Multiple data sources:
■ SVI initialization of phase map: number of samples: 10000; Adam step size = 0.01; ss = [5.,10.]; ls = [1.,2.];
■ MCMC algorithm: number of warmup steps: 100; number of samples: 2000; number of chains = 100; ss = [5.,10.]; ls = [0.1,2.]; sr,j = [0.1, 2.]; lr,j = [0.1,5.]; nj = [0.001,0.1]; br,j = [−2., 2.]
• CAMEO: uses the same parameters as in ref. 16.
Numerous kernel functions exist including the radial basis function which is commonly used for regression and the Matern kernels which are commonly used in classification tasks. These kernel functions have parameters (often called hyperparameters) and the GP algorithm typically uses maximum likelihood to identify the most likely value of these parameters given the data. Additionally, numerous likelihood functions exist for various data challenges based on the expected noise in the data. For the regression case of expected normally distributed noise, the likelihood may be a multivariate normal distribution. Specific likelihood functions exist for the challenges of binary and multi-class classification. For more information on GPs, their theory, implementation, and use, please see the excellent resource.4
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00048j |
‡ NIST disclaimer: certain commercial equipment, instruments, or materials are identified in this paper to foster understanding. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose. |
This journal is © The Royal Society of Chemistry 2024 |