Fabian
Jirasek
*a,
Robert
Bamler
b,
Sophie
Fellenz
c,
Michael
Bortz
d,
Marius
Kloft
c,
Stephan
Mandt‡
e and
Hans
Hasse‡
a
aLaboratory of Engineering Thermodynamics (LTD), TU Kaiserslautern, Erwin-Schrödinger-Str. 44, 67663 Kaiserslautern, Germany. E-mail: fabian.jirasek@mv.uni-kl.de
bData Science and Machine Learning, University of Tübingen, Maria-von-Linden-Str. 6, 72076 Tübingen, Germany
cMachine Learning Group, TU Kaiserslautern, Paul-Ehrlich-Str. 34, 67663 Kaiserslautern, Germany
dFraunhofer Institute for Industrial Mathematics (ITWM), Fraunhofer-Platz 1, 67663 Kaiserslautern, Germany
eDepartment of Computer Science, University of California, Irvine, Donald Bren Hall, 92617 Irvine, CA, USA
First published on 4th April 2022
Predictive models of thermodynamic properties of mixtures are paramount in chemical engineering and chemistry. Classical thermodynamic models are successful in generalizing over (continuous) conditions like temperature and concentration. On the other hand, matrix completion methods (MCMs) from machine learning successfully generalize over (discrete) binary systems; these MCMs can make predictions without any data for a given binary system by implicitly learning commonalities across systems. In the present work, we combine the strengths from both worlds in a hybrid approach. The underlying idea is to predict the pair-interaction energies, as they are used in basically all physical models of liquid mixtures, by an MCM. As an example, we embed an MCM into UNIQUAC, a widely-used physical model for the Gibbs excess energy. We train the resulting hybrid model in a Bayesian machine-learning framework on experimental data for activity coefficients in binary systems of 1146 components from the Dortmund Data Bank. We thereby obtain, for the first time, a complete set of UNIQUAC parameters for all binary systems of these components, which allows us to predict, in principle, activity coefficients at arbitrary temperature and composition for any combination of these components, not only for binary but also for multicomponent systems. The hybrid model even outperforms the best available physical model for predicting activity coefficients, the modified UNIFAC (Dortmund) model.
The present work is focused on thermodynamic properties of fluid mixtures. Physical models for describing these properties are generally based on the concept of pair interactions, which are commonly described by pair-interaction energies. All types of pair interactions in a multicomponent mixture can be investigated by studying the pure components (for the like interactions) and the binary subsystems (for the unlike interactions). Therefore, physical models for thermodynamic properties of fluid mixtures are generally developed using data on pure components and binary mixtures; based on this, they can also be used to predict properties of multicomponent mixtures. In the application of models of thermodynamic properties, their predictive capabilities are of prime importance. To assess them, the relation of the data used in the model development (the training data) to the data to be predicted is important. Regarding thermodynamic properties of mixtures, we distinguish two categories of predictions here:
(i) Those, in which for a given system (a fixed set of components) only the conditions are changed compared to the training data (e.g., temperature, pressure, or concentration of the components); we refer to this as generalization over conditions.
(ii) Those, in which the system itself is changed, i.e., was not included in the training data; we refer to this as generalization over systems.
While (i) involves continuous variables, (ii) is discrete. In the nomenclature of the present paper, we follow the common usage, which says that ‘mixture’ can have two meanings: in the first, it simply designates the combination of specific components (regardless of their concentrations, e.g., ‘water + ethanol’), in the second, also the concentrations are included in the term (e.g., ‘water + ethanol with xwater = 0.1 mol/mol’). In cases where this ambiguity can lead to misunderstandings, we use the term ‘system’ instead of ‘mixture’ when we refer to the first case.
The most common types of thermodynamic models of mixtures are models of the Gibbs excess energy GE, such as UNIQUAC2,3 or NRTL,4 and equations of state (EoS).5 These models excel in generalizing over conditions. By virtue of the underlying thermodynamic theory, they can also be used for generalizing over properties, i.e., if trained on data for a specific property, they can be used to predict a different but related property; this is an important issue, which, however, is not in the focus of the present work.
In their basic form, GE models and EoS are only partially useful for generalizing over systems: for the reasons given above, they are usually trained on data for binary systems, but can then be used for modeling those binary systems only for which data were available during the training. In contrast, predictions of properties of unstudied multicomponent systems with GE models and EoS often turn out to be sufficiently accurate, under the condition that the model was trained on data for all constituent binary subsystems.6
To overcome the lack of generalization over (binary) systems, group-contribution (GC) approaches like UNIFAC,7–9 modified UNIFAC (Dortmund),10,11 and the Predictive Soave–Redlich–Kwong (PSRK) EoS12,13 have been developed. They are based on the idea that components can be split into structural groups and, instead of considering pair interactions between components, pair interactions between these groups are modeled. As the number of relevant structural groups is comparatively small (in the order of 100), the combinatorial problem described above thereby becomes tractable. GC methods contain group-interaction parameters that are usually trained on data for binary systems containing the pertinent groups. They can then be used for predicting the properties of systems for which no data are available.
Unfortunately, the applicability of these GC methods is still hampered by incomplete group-interaction parameter tables. This is due to both the elaborate procedure of fitting new group-interaction parameters, and the fact that often not enough pertinent experimental data are available for a meaningful fit. The problems resulting from an inadequate database also lead to poor performances of GC methods when applied to systems with components that contain groups for which only a few training data points are available. The most prominent alternatives to GC methods are quantum-chemical approaches, namely COSMO-RS.14,15 In principle, COSMO-RS enables predictions for any system based on quantum-chemical calculations, which are, however, computationally costly and not trivial for complex components. Furthermore, also these methods are tuned on experimental data, but the number of parameters is typically low and common users do not change them.16 Compared to GC methods, quantum-chemical methods are often found to be somewhat less accurate in the prediction of thermodynamic properties.17
We have recently introduced a completely new approach for predicting thermodynamic properties of unstudied binary systems.18–21 This approach is based on employing matrix completion methods (MCMs) from machine learning (ML), where the MCMs are prominently associated with recommender systems.22,23E.g., for movie recommendations, these methods can implicitly learn and quantify similarities among users and similarities among movies by observing interaction patterns (ratings or clicks) between them, allowing to predict preference scores for unseen pairs of users and movies. In a similar spirit, our previous work18,19 employed MCMs for predicting activity coefficients γ∞ij of solutes i in pure solvents j at infinite dilution and constant temperature, where the MCM learns similarities among the solutes and among the solvents. Applying the MCM approach for predicting thermodynamic properties is based on the fact that properties of binary mixtures at constant conditions, such as isothermal γ∞ij, can be stored in matrices, where the rows and columns represent the components i and j that make up the mixtures. As these matrices are only sparsely occupied by experimental data, the prediction of the unobserved entries constitutes a matrix completion problem. In our previous work,18,19 we employed a matrix factorization following eqn (1):
γ∞ij = f(θi·βj) | (1) |
The basic idea behind this MCM approach is that information on the pure components (as quantified by θi and βj) is implicitly contained in the mixture data, i.e., in the way the components interact with each other, which manifests itself in the observable mixture property γ∞ij. This information is captured by the MCM and stored in the latent features θi and βj of the components i and j. Using eqn (1) in the trained model, the inferred latent features allow predicting γ∞ij also for previously unstudied combinations i–j. As demonstrated in our previous work,18,19 this approach even outperforms the present state-of-the-art method for predicting activity coefficients, namely the modified UNIFAC (Dortmund) model,10,11 if γ∞ij at 298 K are considered. We have also extended the approach to modeling the dependence of γ∞ij on the temperature T, by simply exploiting the fact that this dependence can be well described by γ∞ij(T) = Aij + Bij/T with system-specific but temperature-independent parameters Aij and Bij in many cases.20
In the present work, we further expand the MCM approach by combining it with the physical modeling of thermodynamic properties of mixtures. We thereby exploit the fact that all physical models of mixtures are based on the idea of pair interactions, described by pair-interaction energies as critical parameters. We propose to predict these pair-interaction energies by an MCM to obtain a new hybrid concept for the prediction of properties of mixtures. The feasibility and the merits of this widely applicable approach are demonstrated by using the well-known GE lattice model UNIQUAC2,3 as an example.
En passant, we come back to the derivation of the UNIQUAC equation, in which two asymmetric pair-interaction parameters ΔUij ≠ ΔUji were introduced instead of using one symmetric pair-interaction energy Uij = Uji as it follows from the lattice theory. The two adjustable parameters ΔUij and ΔUji were introduced as a workaround to get more flexibility for fitting binary phase equilibrium data. We show that this workaround is not necessary and base our considerations on the physical symmetric interaction energies Uij = Uji, explicitly including those for the like interactions Uii.
The resulting new model, which we call ‘MCM-UNIQUAC’ in the following, combines the capabilities of the MCM regarding the generalization over binary systems with those of the physical model UNIQUAC regarding the generalization over conditions. It is a hybrid method that takes advantage of the strengths of both worlds; in particular, it also enables predictions of properties of multicomponent systems by virtue of the physics behind UNIQUAC, a feature a data-driven MCM cannot provide.
lnγij(T,xi) = fUNIQUAC(T,xi,Pi,Pj,Uii,Ujj,Uij) | (2) |
Fig. 1 Illustration of embedding an MCM into a physical model of mixtures (here: the lattice model UNIQUAC). Blue part: application of an MCM to pair-interaction energies Uij. Red part: the physical model relates Uij to temperature- and concentration-dependent activity coefficients lnγij. Yellow part: lnγij are directly related to observable mixture properties (e.g., vapor–liquid equilibria (VLE),24 liquid–liquid equilibria (LLE),25 solid–liquid equilibria (SLE)26) by thermodynamic laws. |
Furthermore, it follows from the derivation of UNIQUAC that all model parameters are independent of temperature and concentration. We use this assumption throughout the present work but note that some authors work with temperature-dependent parameters. Further, as already mentioned in the introduction, it is common in the literature not to use the symmetric pair-interaction energies Uij = Uji as parameters, but rather the two parameters ΔUij ≠ ΔUji, which are calculated by:
ΔUij = Uij − Ujj; ΔUji = Uij − Uii | (3) |
UNIQUAC allows generalizing (i.e., interpolating and extrapolating) over temperatures and concentrations, but not over mixture components. Applying UNIQUAC to a mixture for which no parameters are available requires at least some mixture data points for determining the pair-interaction parameters. This is a severe restriction, as experimental data are often unavailable, especially for mixtures. To overcome this limitation, we introduce MCM-UNIQUAC to extend UNIQUAC so that it can generalize over temperatures and concentrations and over binary systems.
The proposed generalization over binary systems is achieved by an MCM and illustrated in the central (blue) panel in Fig. 1. We thereby model the unlike pair-interaction energies Uij as a sum of dot products:
Uij = θiβj + θjβi, i ≠ j | (4) |
MCM-UNIQUAC was trained end-to-end on a set of measured logarithmic temperature- and concentration-dependent activity coefficients in binary mixtures lnγij (red panel of Fig. 1). We used data from the DDB27 here; in specific, we used lnγij derived from binary vapor–liquid equilibrium (VLE) data using the extended Raoult's law, cf. eqn (S.14) in the ESI,† which we augmented with temperature-dependent data on binary activity coefficients at infinite dilution lnγ∞ij. In total, we obtained a set of 363181 experimental data points for lnγij for 12199 different binary systems i–j involving 1146 distinct components i, j at varying concentrations and temperatures ranging from 183 K to 638 K. The considered N = 1146 components result in N(N − 1)/2 = 656085 possible different binary systems. Experimental data are only available for 12199 of these systems, i.e., data are available for less than 2% of all systems and, consequently, only less than 2% of these systems can be modeled with UNIQUAC in the conventional way. More details on the data set are given in the ESI.†
The systems for which data are available were divided into three sets: 80% were used for training the model (training set), 10% were used for setting the model's hyperparameters (validation set), and 10% were used for testing the predictions (test set). We trained our model using the probabilistic programming language Stan28 and resorted to Variational Inference29–31 for performing approximate Bayesian inference. Details on the random data split, the model training (including the source code to run the model in Stan), and the hyperparameter selection are given in the ESI.†
After the training, MCM-UNIQUAC can predict temperature- and concentration-dependent activity coefficients in any binary and multicomponent mixture of the considered 1146 components. The activity coefficients can, in turn, be used for predicting observable mixture properties, such as VLE or other phase equilibria (yellow panel in Fig. 1). We demonstrate the predictive capacity of MCM-UNIQUAC by considering the data from the test set in the following section.
Furthermore, as baselines, the scores obtained for the different sets by directly fitting UNIQUAC parameters to all available data points are marked as lines in Fig. 2. We thereby considered two variants: first, UNIQUAC was used in the usual manner by fitting the two binary parameters ΔUij and ΔUjiindividually to the data for each system i–j; this procedure is denoted as ‘UNIQUAC (ΔU)’ in the following and the respective results are shown as dotted lines in Fig. 2. Second, we fitted the symmetric pair-interaction energies Uij = Uji to all data points; this procedure is denoted as ‘UNIQUAC (U)’ in the following and the respective results are shown as dashed lines in Fig. 2. Note that considering the full set of 12199 binary systems for which experimental data are available, there are 24398 parameters in UNIQUAC (ΔU), whereas there are only 13345 parameters in UNIQUAC (U), of which 1146 are pure-component parameters Uii.
Let us consider the scores of the baselines UNIQUAC (ΔU) and UNIQUAC (U) first. They simply indicate how well the data can, in principle, be described with the physical model, i.e., UNIQUAC here. The first astonishing message is that UNIQUAC (U) works almost as well as UNIQUAC (ΔU), even though the latter has almost twice as many parameters, cf. above. We have therefore based our hybrid model MCM-UNIQUAC on the more physical UNIQUAC (U).
Second, we observe in Fig. 2 that MCM-UNIQUAC is in general flexible enough for describing the data well, cf. the relatively small differences between the scores of MCM-UNIQUAC on the training set (blue bars) and the baseline scores (lines).
Third, by comparing the scores for the test set (red bars), we find that MCM-UNIQUAC clearly outperforms UNIFAC in accurately predicting lnγij in both MAE and MSE (cf. Fig. S.2 in the ESI†). This is particularly remarkable since UNIFAC has been fitted to most of these data points, whereas no data point from the test set was used for training or validation of MCM-UNIQUAC. An alternative representation of the results for the test set is shown in histogram plots in Fig. S.3 and S.4 in the ESI,† which underpin the superior performance of MCM-UNIQUAC compared to UNIFAC. Similar results are found by considering COSMO-SAC-dsp32,33 as baseline, as shown in Fig. S.5 in the ESI.† Additionally, we discuss the influence of the number of training data points on the performance of MCM-UNIQUAC in Fig. S.6 and Table S.1 in the ESI.†
We note that also a version of MCM-UNIQUAC based on UNIQUAC (ΔU), i.e., based on the asymmetric pair-interaction parameters, can be trained if and only if an end-to-end training on lnγij is performed; the resulting performance is, however, slightly worse than that of the model discussed above as we demonstrate in Fig. S.7 in the ESI.†
MCM-UNIQUAC can not only be used for predicting activity coefficients, but also for predictions of phase equilibria and many other thermodynamic properties of any mixture consisting of the considered 1146 components. We demonstrate this in Fig. 3, which shows the results of the prediction of isobaric vapor–liquid equilibrium (VLE) phase diagrams for eight binary systems based on the extended Raoult's law, cf. eqn (S.14) in the ESI.† All eight systems were chosen randomly from the test set, i.e., not a single data point for these systems was used for training MCM-UNIQUAC or setting its hyperparameters. However, the selection was carried out in such a way as to cover a wide range of different phase behaviors, ranging from high-boiling azeotrope (top left) to heteroazeotrope (bottom right). For details on the selection of the systems and the prediction of the VLE phase diagrams with MCM-UNIQUAC, we refer to the ESI.†
Fig. 3 Prediction of isobaric vapor–liquid phase diagrams for binary systems from the test set with MCM-UNIQUAC (lines) and comparison to experimental data from the DDB27 (symbols). No data on any of the depicted systems were used for training MCM-UNIQUAC or setting the hyperparameters. Blue: dew point curves. Red: bubble point curves. |
For all eight binary systems, excellent agreement between the predicted phase diagram and the experimental data is found, both qualitatively and quantitatively. Similarly, liquid–liquid and solid–liquid phase diagrams and other thermodynamic properties like excess enthalpies can be predicted with MCM-UNIQUAC. Moreover, the physical foundation on which the hybrid approach MCM-UNIQUAC builds, namely learning and predicting pair-interaction energies between components on the hypothetical lattice from data for binary mixtures, even allows extrapolations from binary to multicomponent mixtures; MCM-UNIQUAC thereby does not require any data of multicomponent systems for training. As an example, isobaric VLE phase diagrams for two ternary systems are shown in Fig. 4. These systems were selected such that all constituent binary subsystems were neither part of the training set nor of the validation set.
Fig. 4 Prediction of the VLE in ternary systems at constant pressure with MCM-UNIQUAC and comparison to experimental data (exp.) from the DDB.27 The pressure and the composition of the liquid phase were specified, the composition of the corresponding vapor phase was predicted (pred.). Top: acrylic acid (AcrAc) + acetic acid (AceAc) + tetrachloromethane (TCM) at 100 kPa. Bottom: toluene (Tol) + isopropylbenzene (IPB) + α-methyl styrene (α-MS) at 101 kPa. No data on any of the systems and any of the binary subsystems were part of the training set or the validation set of MCM-UNIQUAC. |
For each data point, the pressure and the composition of the liquid phase (blue symbols in Fig. 4) were specified. The associated composition of the vapor phase in equilibrium was predicted with MCM-UNIQUAC (open red symbols) and compared to the experimental composition of the vapor phase as reported in the DDB27 (filled red symbols). We observe an excellent agreement.
In the ESI,† we additionally provide a complete set of the pair-interaction energies Uij, including the respective model uncertainties in the form of standard deviations, for all 656085 binary systems of the components considered here (the pair-interaction parameters ΔUij of the commonly used version of UNIQUAC can easily be calculated from these by applying eqn (3)). This set of Uij was obtained from an end-to-end training of MCM-UNIQUAC on the entire available data set of all 12199 systems. If predictions for any system of the components considered in this work are required, this parameter set should be used.
We trained our hybrid approach, MCM-UNIQUAC, on experimental data on activity coefficients in binary mixtures of 1146 components from the Dortmund Data Bank (DDB). Out of the possible 656085 binary systems that can be formed from these components, suitable experimental data were only available for 12119 systems, corresponding to only 2%. In its basic form, UNIQUAC can only be applied to this subset, as pair-interaction parameters need to be fitted to data of the respective systems. In contrast, MCM-UNIQUAC yields, in principle, predictions for all 656085 binary systems; we demonstrate the predictive capacity of the model based on the systems for which experimental data are available by withholding the test data points during the training of the model. Moreover, by virtue of being a model based on pair interactions, MCM-UNIQUAC can be applied not only to binary systems but also to any multicomponent system that can be formed from the considered 1146 components; and it also yields predictions for any composition and temperature.
We compared the quality of these predictions to those from the best available physical model for this purpose, the group-contribution model modified UNIFAC (Dortmund). However, due to missing group-interaction parameters, the public version of UNIFAC can at present only be applied to 9502 of the 12119 binary mixtures for which data were available. We show that MCM-UNIQUAC outperforms UNIFAC, even on a test set whose data were not used for training MCM-UNIQUAC, whereas most of these data were probably used for training UNIFAC.
In its commonly used version, UNIQUAC has two binary parameters ΔUij ≠ ΔUji (with i ≠ j). It is known that they are highly correlated and hard to interpret physically. In this work, we have discovered that going back to the basic idea of the lattice model and working directly with symmetric pair-interaction energies Uij = Uij, i.e., with only a single binary parameter for each binary system, gives almost the same quality of the description of the phase equilibria in the studied binary systems as working with the common version of UNIQUAC with two binary parameters, which is highly remarkable. It can be assumed that the pair-interaction energies Uij, which are now available for 656085 binary systems and 1146 pure components, can be interpreted physically. As they are basically mean-field energies, this is probably only a first step, and deeper insights may be expected from applying the MCM in connection with more sophisticated mixture models that account for different types of interactions.
MCM-UNIQUAC combines the strengths of ML and physical modeling – the MCM enables generalization over (discrete) components, allowing predictions for unstudied binary systems, whereas UNIQUAC generalizes over (continuous) conditions and enables the extrapolation to multicomponent mixtures. As demonstrated in examples, MCM-UNIQUAC allows the direct prediction of phase diagrams as exemplified for binary and ternary vapor–liquid phase diagrams. Furthermore, our approach can be retrained easily whenever additional data points become available; this is in contrast to established group-contribution methods, which require extensive parameter tuning. Also, MCM-UNIQUAC does not rely on expensive calculations as required by established quantum-chemical prediction methods. Finally, by exploiting the fact that basically all established thermodynamic models for mixture properties rely on the description of pair interactions, our approach can be further developed in many directions, e.g., by combining MCMs with equations of state and even group-contribution methods.
Footnotes |
† Electronic supplementary information (ESI) available: UNIQUAC equations; information on data set and data preprocessing; model details incl. source codes to run the model in Stan; additional results; complete UNIQUAC parameter set. See https://doi.org/10.1039/d1sc07210b |
‡ These authors jointly supervised this work. |
§ For a set of N components, there are N(N − 1)/2 binary systems and, following the common usage, N(N − 1) binary UNIQUAC parameters ΔUij. Using the pair-interaction energies Uij directly as parameters, there are N(N + 1)/2 parameters, of which N are pure-component parameters. Hereby N(N − 1) > N(N + 1)/2 for N ≥ 3. |
This journal is © The Royal Society of Chemistry 2022 |