Johannes
Zenn
*a,
Dominik
Gond
b,
Fabian
Jirasek
b and
Robert
Bamler
c
aTübingen AI Center, University of Tübingen, IMPRS, 72076 Tübingen, Germany. E-mail: johannes.zenn@uni-tuebingen.de
bLaboratory of Engineering Thermodynamics (LTD), RPTU Kaiserslautern, 67663 Kaiserslautern, Germany
cTübingen AI Center, University of Tübingen, 72076 Tübingen, Germany
First published on 15th January 2025
Predicting the physico-chemical properties of pure substances and mixtures is a central task in thermodynamics. Established prediction methods range from fully physics-based ab initio calculations, which are only feasible for very simple systems, over descriptor-based methods that use some information on the molecules to be modeled together with fitted model parameters (e.g., quantitative-structure–property relationship methods or classical group contribution methods), to representation-learning methods, which may, in extreme cases, completely ignore molecular descriptors and extrapolate only from existing data on the property to be modeled (e.g., matrix completion methods). In this work, we propose a general method for combining molecular descriptors with representation learning using the so-called expectation maximization algorithm from the probabilistic machine-learning literature, which uses uncertainty estimates to trade off between the two approaches. The proposed hybrid model exploits chemical structure information using graph neural networks, but it automatically detects cases where structure-based predictions are unreliable, in which case it corrects them by representation-learning based predictions that can better specialize to unusual cases. The effectiveness of the proposed method is demonstrated using the prediction of activity coefficients in binary mixtures as an example. The results are compelling, as the method significantly improves predictive accuracy over the current state of the art, showcasing its potential to advance the prediction of physico-chemical properties in general.
On the one hand, descriptor-based methods correlate information on the molecules to be modeled with properties of interest. Among these, group-contribution methods, which use the composition of the components in terms of structural groups as molecular descriptors and whose underlying equations are usually derived from physical theories, have succeeded and are still the gold standard for property prediction in many (industrial) fields.2,3 The most successful group-contribution method for predicting activity coefficients is UNIFAC,4–6 which is available in different versions and established in most process simulation software. Besides the physics-based group-contribution methods, also other descriptor-based methods that rely on various descriptors, such as molecular weight, surface area, or boiling point, have been proposed for predicting activity coefficients.7–14
In statistics parlance, such descriptor-based models are called parametric models as they fit model parameters that affect several related components (e.g., those containing a given structural group). Thus, parametric models can leverage statistical strength across chemically similar components.
On the other hand, recent idea transfer from the machine-learning community has led to an alternative approach to predicting activity coefficients and other mixture properties, which is based solely on learned representations without relying on descriptors. So-called matrix completion methods (MCMs)15–19 ignore the chemical structure of components and fit individual representation vectors for each mixture component that appears in a set of available experimental data. While this approach makes MCMs more flexible than descriptor-based methods, it prevents them from exploiting structural similarities across components, and from extrapolating to new components. In statistics parlance, one says that MCMs are “nonparametric in the components” (note that nonparametric models tend to have many parameters, similar to how a “stepless” controller has infinitely many steps).
It was shown empirically15,16,20 that purely nonparametric MCMs make more accurate predictions for activity coefficients compared to the descriptor-based (parametric) state-of-the-art UNIFAC. However, since each fitted parameter (i.e., each representation vector) in an MCM only describes a single component, MCMs can only make predictions for mixtures where each component appears in some (other) mixtures in the available experimental data (“in-domain predictions”). By contrast, descriptor-based methods can exploit the structural similarity of components to extrapolate to components that appear in no mixture in the available experimental data (“out-of-domain predictions”).
In this work, we propose a new method for predicting activity coefficients in binary mixtures that combines the strengths of both the parametric (descriptor-based) and the nonparametric (representation-based) approach while avoiding their respective weaknesses. To do this, we phrase both a parametric and a nonparametric model in a probabilistic framework, and we fit them jointly using the so-called variational expectation maximization (variational EM) algorithm. This algorithm finds an optimal compromise between the parametric and the nonparametric part of a model, taking into account how confident each part is in its fitted or predicted parameters, as discussed in the next section. Our evaluation shows that weighing off the respective confidences of the parametric and nonparametric models indeed improves the accuracy of both in-domain and out-of-domain predictions.
While this paper focuses on the concrete task of predicting activity coefficients of pure solutes at infinite dilution in pure solvents at room temperature, the proposed method can, e.g., be generalized to arbitrary temperatures and concentrations following the procedure described in Jirasek et al.,20 and to other thermodynamic properties of binary mixtures by fitting it to a corresponding dataset. More generally, we argue that the variational EM algorithm is a valuable tool in thermodynamic modeling since it allows for combining the strengths of descriptor-based (parametric) and representation-based (nonparametric) models, which is a powerful approach beyond the modeling single thermodynamic properties of binary mixtures.
In the remaining sections of this paper, we first formalize the problem setup, present the proposed method, and discuss several variants of its concrete execution. We then empirically evaluate the accuracy of predicted activity coefficients and compare them to existing methods and simplified variants of our proposed method (ablation studies).
![]() | ||
Fig. 1 Model and data flow for training the proposed model. (Left) graph neural networks take chemical structure information and output the parameters of conditional prior probability distributions (eqn (1)) over abstract representation vectors. (Right) the likelihood (eqn (2)) models how well given representation vectors explain experimentally measured activity coefficients γ∞i,j. We use variational EM (Sections 2.3 and 2.4) to fit the neural network weights θ (parametric, descriptor-based part), and to find variational distributions for each solute and solvent (nonparametric, representation-based part). |
A previous deep-learning-based approach14 addressed this prediction problem with a combination of three neural networks. The first two networks are so-called graph neural networks (GNNs) that take as input the molecular graph structures of the solute and solvent, respectively, i.e., each atom kind, their hybridizations and formal charges, and the type of bond between each pair of atoms. The GNNs map the molecular graphs to so-called abstract representation vectors u ∈ K and v ∈
K, respectively, where the dimension K is a modeling choice. The third neural network combines u and v and outputs a prediction for the activity coefficient for the respective solute at infinite dilution in the respective solvent. This existing approach can perform out-of-domain predictions because the neural networks can extrapolate to new molecular structures as long as they share some common substructures with the ones in the training data. But this approach has the downside that it uses an entirely parametric model, i.e., it is limited by the expressiveness of the neural networks and cannot make any exceptions in case some anomalous components behave very differently than structurally similar components. Our proposed method, described below, accounts for exceptions with anomalous behavior in a nonparametric way.
![]() | (1) |
The representation-based (nonparametric) part of our model is a probabilistic MCM.15 It represents each solute i and each solvent j that appears in the experimental data with an individual representation vector ui, vj ∈ K, respectively, which it uses to predict the activity coefficients γ∞i,j. Since activity coefficients range over several orders of magnitude, we model their logarithm, ln
γ∞i,j. We use a simple Gaussian likelihood,
![]() | (2) |
We propose to train the nonparametric and parametric parts of the model jointly using the so-called variational expectation maximization (variational EM)22,23 algorithm. Variational EM allows us to fit a model that can generalize across components with similar chemical structures while still being able to learn exceptions for individual components where the experimental data shows evidence for anomalous behavior.
The arrows in Fig. 1 show the direction of data flow in variational EM. The algorithm concurrently fits both the neural network weights θ of the conditional priors and a so-called variational distribution qϕ(ui) and qϕ(vj) for each solute i and each solvent j that appears in the experimental data. The weights θ of the conditional priors are fitted to model the data as well as one can with a parametric model. By contrast, the variational distributions are fitted in a nonparametric way. They are fitted to find a compromise between the conditional priors (which can share statistical strength across chemically similar components but cannot make exceptions for anomalous cases) and the experimental data (which may contain evidence for anomalous behaviors, but which is often scarce and generally affected by measurement errors).
In our setup, the necessary constraint on the prior comes from the finite expressiveness of the neural networks: unless the neural network for, e.g., solute representation vectors ui is exorbitantly large, it cannot output completely independent prior parameters (uμθ(ri), uσθ2(ri)) for all solutes i in the dataset. Thus, fitting the neural network weights θ cannot perfectly overfit the prior to the data. By contrast, the variational distributions qϕ(ui) and qϕ(vj) are fitted nonparametrically, i.e., with individual parameters for each solute and solvent. The reason why these do not perfectly overfit the data is because they are not fitted solely to the data but instead obtained by (approximate) Bayesian inference with the (non-overfitting) prior, as explained in Section 2.4 below.
The positions of the ellipses in representation space are not directly interpretable, but their sizes indicate uncertainty estimates. For example, the prior predictions for methylsulfolane (left panel in Fig. 2) have low uncertainty (the turquoise ellipses are small). This is expected to happen for solvents (and equally for solutes) where the dataset contains structurally similar solvents that empirically behave similarly in mixtures, thus allowing the neural network to effectively and confidently interpolate between them. The low prior uncertainty causes variational EM to trust the prior predictions, and to fit a variational distribution qϕ(vj) (solid black ellipses) that closely follows the prior.
By contrast, the prior predictions for water (right panel in Fig. 2) have high uncertainty (the turquoise ellipses are large). This is expected to happen if the dataset contains solvents that are structurally similar to water but behave very differently in mixtures. Such anomalous cases prevent the neural network from interpolating effectively. However, as we discuss in Section 2.4 below, the neural network is at least fitted to detect such cases, and to reflect them by outputting a large uncertainty estimate vσθ2(s). As can be seen in the right panel of Fig. 2, the high prior uncertainty allows variational EM to fit the variational distribution (black ellipses) more freely, thus, in a sense, overriding the descriptor-based prior (mean) prediction vμθ(s) (turquoise cross) in this case. Note that the uncertainty of the (approximate) posterior (size of the black ellipses) for water is small despite the large prior uncertainty. This is expected since the data set contains a lot of experimental data where the solvent is water.
![]() | (3) |
A naive approach to training the neural networks would attempt to find the network weights θ that maximize the so-called marginal likelihood pθ(lnγ∞ | r, s), i.e., the probability density of predicting the experimentally measured logarithmic activity coefficients ln
γ∞ for all binary systems where experimental data is available. Unfortunately, the marginal likelihood is not accessible in our model because obtaining it would require marginalizing eqn (3) over u and v,
![]() | (4) |
![]() | (5) |
To derive a valid expression for the ELBO, variational inference replaces the integral on the right-hand side of eqn (4) with a form of biased importance sampling.28 One first chooses a family of typically simple probability distributions qϕ(u, v) that are parameterized by ϕ and called variational distributions. For simplicity, we use the so-called Gaussian mean-field approximation, i.e., we choose a family of fully factorized normal distributions with
![]() | (6) |
![]() | (7) |
![]() | (8) |
Maximizing the ELBO in eqn (7) over both θ and ϕ trades off between three objectives:
(i) maximizing the first term on the r.h.s. of eqn (7) over ϕ tries to fit variational distributions qϕ(ui) and qϕ(vj) such that samples from them explain the experimental data in ;
(ii) maximizing the last two terms in eqn (7) over ϕ (i.e., minimizing the KL-divergences over ϕ) regularizes the fits, i.e., it keeps the variational distributions qϕ(ui) and qϕ(vj) close to the conditional priors. Here, the first term on the r.h.s. of eqn (8) penalizes deviations between prior mean and variational mean stronger for smaller prior variance uσθ2(ri)α. Thus, the (parametric) prior model has a stronger effect on the (nonparametrically fitted) variational distributions when it is confident in its prediction, as claimed in the discussion of Fig. 2;
(iii) minimizing the KL-divergences in eqn (7) also over θ fits the neural networks that define the conditional priors to the variational distributions, and thus indirectly to the data. This includes fitting the prior variances u/vσθ2(·) to model the aleatoric uncertainty observed in the data plus any changes between the variational distributions of structurally similar components that cannot be resolved by the prior due to the finite expressiveness of the neural networks.
We maximize the ELBO over θ and ϕ with stochastic gradient descent, using reparameterization gradients29 for the first term on the right-hand side of eqn (7), and automatic differentiation provided by common software frameworks for machine-learning.30 Algorithm 1 summarizes the algorithm. Our implementation is available online (see section “Data and software availability”). Training our largest model variant (GNN MCM, see below) took about four hours on a single GPU (Nvidia GeForce RTX 2080 Ti).
![]() | (9) |
![]() | ||
Fig. 3 Data flow for a prediction where the solvent appears in the training set (in-domain) but the solute does not (out-of-domain). We thus predict the solute representation vector ûi from the prior, and the solvent representation vector ![]() |
In detail, the MoFo MCM uses two neural networks (one for solutes and one for solvents) that receive a fixed-size integer-valued vector as input. Each entry of the input vector corresponds to a given atom or bond type, and the values at these entries count the number of occurrences of the given atom or bond type in the molecule. Specifically, we use 16-dimensional input vectors for the 12 atoms O, Si, I, F, Br, P, H, S, Sn, N, C, and Cl present in the data set and the 4 bond types single, double, triple, and aromatic. The network outputs a 2K-dimensional vector that is the concatenation of the prior means u/vμθ(·) and variances u/vσθ2(·).
The GNN MCM uses two graph neural networks, whose inputs are the molecular graphs of the solute and solvent, respectively. More specifically, we encode atoms and bonds from the same vocabulary as in the MoFo MCM with learnable real-valued vectors, which we use as initial node and edge features for the GNN. In general, message-passing GNNs operate on such graph-structured inputs by performing transformations of the node and edge features over multiple layers via a message-passing scheme.34 The output of a GNN is computed from all node features (and possibly edge features) at the last layer with a readout function. Generally, the message-passing scheme consists of a message step, an aggregation step, and an update step. In each layer, a message is computed for each directed edge utilizing a message function whose parameters are part of the learnable neural network parameters θ. Incoming messages are aggregated by a sum for each node. The update function produces new node features depending on the previous node features and the aggregated message, and its parameters are also part of θ.
Message, aggregation, and update steps are specific to the architecture of the GNN. In this work, we utilize the Feature-wise Linear Modulation (FiLM) GNN.35 The FiLM model uses the target features of a directed edge as input to a hyper-network that determines element-wise affine transformation parameters. Messages are computed by scaling and shifting the input features with the element-wise affine transformation parameters, where the input features result from multiplying a learnable matrix with previous features. The update function of a node sums over the aggregated messages for each edge type, where also transformation parameters are computed for each edge type.
For the second ablation study, we simplify the model by removing its nonparametric part, and we train it by simple maximum likelihood estimation (MLE) rather than variational EM. Thus, in this ablation, the neural networks only output means uμθ(ri) and vμθ(sj) and no variances, and we use these means directly as representation vectors ui and vj, respectively, in the likelihood (eqn (2)), which our training objective maximizes over the neural network parameters θ (similar to Medina et al.14). Since this simplified model has no variational distributions, predictions are again done as if all mixture components were out of domain.
We implement our models in PyTorch30 and use PyTorch Geometric37 for the GNN. All models are trained for 15000 epochs using the Adam38 optimizer. In the MoFo (MLE) ablation study, we use early stopping,39,40i.e., we compute validation errors every 10 epochs and use the model with the lowest validation mean squared error (MSE) to evaluate on the test set. This is done to be as lenient as possible to the ablation study, and because MLE training is more prone to overfitting than variational EM. When training with variational EM, we do not use the validation set.
To find well-performing hyperparameters (e.g., the learning rate schedule and the dimension K of the abstract representation space), we utilize a sparse random grid search. See ESI† for more information on this process and the hyperparameters that we search over. We choose the best model of the grid search according to its MSE on a predefined dataset split that is the same for all models and different from any other split. In order to fairly compare all models, we exclude the test data of the predefined dataset (that is used to determine the hyperparameters) from the evaluation.
Medina et al.14 use a different dataset and train an ensemble of 30 models for prediction where each model has been trained on randomized train/validation splits. For a fair comparison against this baseline, we train a separate GNN MCM for each of these train/validation splits, using again a sparse random grid search for hyperparameter tuning.
Model | MAE | MSE |
---|---|---|
GNN MCM | 0.1542 ± 0.0046 | 0.0905 ± 0.0071 |
Medina et al.14 | 0.1973 ± 0.0067 | 0.1196 ± 0.0074 |
The comparison to the fully nonparametric MCM15 (second row in Fig. 4) is only possible for in-domain predictions as this baseline cannot perform out-of-domain predictions. Here, the GNN MCM (fifth row) approximately halves MAE, and it reduces MSE (which is more sensitive to outliers) even more significantly.
The published evaluation results in Medina et al.14 do not distinguish between in-domain and out-of-domain predictions, effectively averaging over both. Using the same training data and evaluation setup, our proposed GNN MCM significantly reduces both MAE and MSE (Table 1).
The fourth and seventh rows in Fig. 4 show prediction errors of the MoFo MCM variant of our model, whose conditional priors condition only on the molecular formula of mixture components but not on their chemical structures. We find that this model variant performs worse than the GNN MCM model on both in-domain and out-of-domain prediction tasks. For in-domain predictions, we can compare again to the fully nonparametric MCM (second row in Fig. 4), which does not exploit any chemical information about the mixture components. We find, as expected, that performance improves with increasing granularity of exploited chemical information: MoFo MCM performs better than the fully nonparametric MCM but worse than GNN MCM.
The results for in-domain and out-of-domain predictions are labeled “MoFo MCM (MLE)” in Fig. 4. For in-domain predictions, we observe that models trained with MLE perform significantly worse than models trained with variational EM, but better than if we train with variational EM and then discard the nonparametric part (see Ablation 1 above). For out-of-domain predictions, the picture is less clear. Here, models trained with MLE perform slightly better than their variational EM counterparts, in particular in terms of MSE, which penalizes outliers more strongly. A possible explanation is that variational EM allows the parametric prior models to effectively ignore any mixture components that can be better modeled in a nonparametric way. This makes the priors in variational EM less regularized, so they are more susceptible to overfitting to the training data, which can result in worse generalization to unseen mixture components in out-of-domain predictions. However, this slight improvement of MoFo MCM (MLE) over MoFo MCM on out-of-domain predictions comes at the cost of significantly reduced performance on in-domain predictions, where the lack of a nonparametric model part prevents MoFo MCM (MLE) from specializing to anomalous components. Further, the proposed GNN MCM model improves performance over both MoFo MCM and MoFo MCM (MLE) significantly on both in-domain and out-of-domain predictions.
![]() | ||
Fig. 5 Improvement (in terms of mean average error, MAE) of the proposed GNN MCM method over UNIFAC and MCM, grouped by chemical category of the solute (see Table 6 in the ESI† for a definition of the categories). Our method consistently improves over both UNIFAC and MCM across almost all solute categories; we see regressions (negative improvements) only on three categories with poor statistics in this evaluation due to small sample sizes (see gray numbers). |
![]() | ||
Fig. 6 Improvement of the proposed GNN MCM over UNIFAC and MCM, by chemical category of the solvent (see Table 6 in the ESI† for a definition of the categories). Our method consistently improves over both UNIFAC and MCM across all solvent categories. |
We observe that the improvements of our proposed GNN MCM are systematic across almost all categories of solutes and solvents. The only regressions occur within the solute categories “S_POL” (strongly polar sulfurous compounds), “S_NPOL” (weakly polar sulfurous compounds), and “FF” (perfluorinated compounds). The results in these three categories should be taken with a grain of salt as the dataset contains only very few mixtures that involve a solute from one of these categories (gray numbers in Fig. 5). Thus, within these categories, the MAE averages only over 6, 4, or 2 values, respectively, making it highly susceptible to outliers.
For each solute i, we set n(i) to the number of times that i appears as a solute in the data set . The inverse of this function,
maps each possible solute count
to all solutes in the data set that appear n times in
. For each n where i(n) is not empty, we now calculate the average prediction error for all binary mixtures in the data set whose solute is in i(n). We proceed analogously for the solvents.
Fig. 7 shows the improvement of our proposed GNN MCM over both UNIFAC (olive) and MCM (terracotta) as a function of the frequency n of solutes (top) and solvents (bottom). We observe that GNN MCM improves over both UNIFAC and MCM consistently across all solute and solvent frequencies n (i.e., almost all points in the plots lie above the dashed zero line). The improvement over MCM (terracotta lines) is most pronounced in the regime of small n, i.e., where few data points with the same solute or solvent exist. This is to be expected since this is the regime where a fully nonparametric model like MCM tends to perform poorly. The strong improvement for the solvent with highest n (right end of lower plot) can be explained as this solvent is water, for which a lot of data points with infrequent solutes exist in the dataset.
In summary, we find that our proposed method significantly improves predictive accuracy over both fully parametric and fully nonparametric baselines (Fig. 4 and Table 1), and that this improvement is consistent across mixture components from different chemical categories (Fig. 5 and 6) and across varying amounts of training data for involved components (Fig. 7).
Our ablation studies identify the variational EM algorithm to be crucial for the success of the prediction method. We think that variational EM can be a useful tool for many physico-chemical prediction problems since it balances structure-based and representation-learning based predictions by weighing off their respective uncertainties.
Future work should explore the application of our method to other properties like diffusion coefficients or even fundamental quantities like interaction energies, which are at the core of established physical models of mixtures and based on with diverse mixture properties can be described. In a broader context, our work provides additional evidence for the efficacy of graph neural networks for processing chemical structure information. It would be interesting to study whether activations of hidden layers of the graph neural networks can be made interpretable to human domain experts, whether correlations between the hidden activations of different atoms can be used to identify relevant substructures of molecules, and whether such substructures correspond to the structural groups considered in established group contribution methods.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00154k |
This journal is © The Royal Society of Chemistry 2025 |