Edgar Ivan
Sanchez Medina
a,
Steffen
Linke
a,
Martin
Stoll
b and
Kai
Sundmacher
*ac
aChair for Process Systems Engineering, Otto-von-Guericke University, Universitätsplatz 2, Magdeburg, 39106, Germany
bChair of Scientific Computing, Department of Mathematics, Technische Universität Chemnitz, 09107 Chemnitz, Germany
cProcess Systems Engineering, Max Planck Institute for Dynamics of Complex Technical Systems, Sandtorstraße 1, Magdeburg, 39106, Germany. E-mail: sundmacher@mpi-magdeburg.mpg.de
First published on 11th February 2022
The use of predictive methods for physicochemical properties is of special interest given the difficulties involved in the experimental determination of large chemical spaces. In this work, we focus on the prediction of infinite dilution activity coefficients γij∞ of organic systems using graph neural networks (GNNs). Our proposed method involves the use of one GNN that extracts the relevant solvent information and one GNN for doing so for the solute. The vectorial representations of these chemical species are then combined into a binary-system fingerprint which is used as the input to a supervised learning framework. We compare our approach to the 8 most commonly employed phenomenological/mechanistic methods for predicting γij∞. Our method is able to predict γij∞ with competitive performance to the state-of-the-art mechanistic methods, achieving a lower mean absolute error (MAE) compared to the broadly used COSMO-RS and UNIFAC-Dortmund methods. We also present a series of parallel residual hybrid models that combine both mechanistic and GNN-based approaches. These hybrid models overall improve the performance of the individual model instances.
One thermodynamic property of special interest is the so-called activity coefficient γ, which measures the degree of nonideality of a substance in a liquid mixture due to intermolecular or intramolecular forces. Naturally, γ depends on the temperature and composition of the mixture (i.e., pressure dependence can be neglected in most cases), and for each temperature the limiting case occurs when a solute i is infinitely diluted in a solvent j. The activity coefficient at these conditions is denoted as γij∞. Its importance for chemical processes is very broad because separation processes achieving high product purities reach approximately infinite dilution conditions. Furthermore, in distillation, γij∞ is used to predict whether an azeotrope is expected to occur or not.1,2 And in liquid–liquid extraction, the selection of a suitable solvent is often based on γij∞ values.3,4 Moreover, γij∞ plays a major role for safety and environmental protection studies.4,5
Many models exist that are used for the prediction of γij∞. We can divide these models into the ones based on mechanistic or phenomenological knowledge6–13 (refer to as mechanistic models in this work), and those which are mainly constructed using machine learning techniques.14–21 So far, the use of mechanistic models is more common compared to machine learning methods. The reason for this is that mechanistic models have been developed, studied and tested for decades as opposed to the machine learning approaches. However, many deviations from reality are still present in mechanistic models.22 For instance, in the system methylcyclohexane–toluene as solute and acetophenone as solvent deviations larger than 41% are still encountered23 when using UNIFAC-Dortmund (the recommended mechanistic model for such type of compounds according to the literature22). The reader is referred to the work of Brouwer et al.22 for a detailed explanation of the current limitations of mechanistic models. With the recent advances in computational power and the digitization of large databases, data-driven methods are becoming an interesting focus of research in this context.18,21,24–26
The most commonly used mechanistic models for predicting activity coefficients can be grouped into 4 categories: solvation models,6–8,27,28 group contribution methods,9–11,29–32 linear solvation relationships12,33–37 and COSMO-RS.13,38,39 A recent work22 re-visited the main eight models in these categories and compared them in the context of predicting γij∞ at 298.15 K. Solvation models are based on the Flory–Huggins theory27,28 and include the (1) Hildebrand model,6 (2) Hansen Solubility Parameters (HSP),7 and the (3) Modified Separation of Cohesive Energy Density (MOSCED) model.8 Similarly, most common group contribution methods are also constructed on modifications of the Flory–Huggins model, and are mostly represented by (4) UNIFAC9 and the modification thereof, known as (5) UNIFAC-Lyngby10,29 and (6) UNIFAC-Dortmund.11,30–32 Linear solvation energy relationships combine solute and solvent descriptors using multilinear regression. The most prominent instance of such models is the one developed by (7) Abraham.12 By contrast, (8) COSMO-RS13,38,39 is based on quantum mechanical calculations and molecular energy interaction estimations (i.e., the resulting surface charges of the individual molecules are energy-optimal paired to an interaction model).
Many of the machine learning methods that have been developed for predicting infinite dilution activity coefficients have been built as quantitative structure–property relation (QSPR) models.14–19 For constructing such models typically two steps are necessary: (1) molecular descriptors selection and (2) training of a regression method (e.g., linear regression, support vector machines, artificial neural networks). In this approach many quantum-chemical and/or topological descriptors are calculated and discarded for not being relevant enough for an accurate prediction. A notable exception to this was the application of the matrix completion methodology (MCM) that allows for the predictions to be based only on an incomplete matrix of solvent–solute activity coefficients.21,40,41 This eliminates the necessity of computing many expensive descriptors. However, the application domain of the MCM method is limited to the systems that define the matrix. Also, the accuracy of the predictions heavily depends on the correlation and the amount of the available data.41 A common limitation of both QSPR and MCM methods available in the literature is that no information on the uncertainty of the predictions is provided (i.e., no error bars are usually reported).
In similar contexts of molecular property prediction, graph neural networks (GNNs) have shown promising results, achieving state of the art performance on several applications.42–45 In this framework, a molecule is represented as a graph with nodes and edges defined by the atoms and chemical bonds, respectively. Additionally, in the context of predicting γij∞ one molecule acts as the solute while the other acts as the solvent. The structural differences and the molecular interactions between the solute and solvent naturally affect the value of γij∞. The principal contribution of this paper is precisely to investigate the use of GNN-based models for predicting γij∞ of organic solvent/solute systems at a constant temperature, and to compare the GNN predictions to the 8 most relevant mechanistic approaches described above. By using GNNs the system descriptors are automatically learned in an end-to-end fashion using backpropagation and predictions can be obtained to systems that are not included in the training set. We also explore the use of ensemble learning and hybrid modeling in order to develop residual GNNs that can potentially be used for explainability. To the best of our knowledge, GNNs have never been used for this purpose.
Fig. 1 Schematic representation of all possible binary mixtures formed by the 156 solutes and 262 solvents considered in this work. The black squares indicate mixtures for which experimental data on γij∞ at 298.15 K is available within the studied database.22† |
Feature | Description | Dimension |
---|---|---|
Atom type | (C, Br, Cl, N, O, I, S, F, P) | 9 |
Ring | Is the atom in a ring? | 1 |
Aromatic | Is the atom part of an aromatic system? | 1 |
Hybridization | (sp, sp2, sp3) | 3 |
Bonds | Number of bonds the atom is involved in | 4 |
Charge | Atom's formal charge | 3 |
Attached Hs | Number of bonded hydrogen atoms | 4 |
Feature | Description | Dimension |
---|---|---|
Bond type | (Single, double, triple, aromatic) | 4 |
Conjugated | Whether the bond is conjugated | 1 |
Ring | Whether the bond is part of a ring | 1 |
As an example, Fig. 2 shows a representation of the encoding SMILES-to-graph for 1-butanol (CCCCO) and acetone (CC(C)O). Using RDKit the one-hot encoded feature vector for each atom in the molecule is obtained and defines each row of matrix . Similarly, the matrix is constructed by stacking all the row-vectors corresponding to each representation of each chemical bond. Notice that in the orange and blue depiction of the matrices we are highlighting the feature values differences using darker colors (e.g. a darker square on the first column represents the atom “Oxygen” while a lighter color represents “Carbon”). However, as previously mentioned, in reality these matrices only contain binary values. The connectivity matrix indicates whether two atoms in the molecule are connected to each other (indicated by black squares). Finally, using PyTorch Geometric49 the 3 matrices are stored into tensor collections representing each molecular graph.
Fig. 2 Schematic illustration of the construction of initial graphs from the solvent and solute SMILES. First, RDKit47 is used to extract the atom features, bond features and connectivity. Then, Pytorch Geometric (PyG)49 is employed to embed the corresponding matrices A and B into the graph with the connectivity described by matrix C. |
This way of representing molecules into graphs can be compared to other representation techniques of molecules, such as using a set of functional groups in group contribution methods. While the latter groups are defined manually using mostly chemical expert's knowledge, the former graph representation allows for methods such as GNNs to automatically optimize the binary-system fingerprint according to the structural information provided and its relative importance for predicting γij∞ (according to a provided error metric).
(1) |
Fig. 3 Schematic representation of the proposed GNN-based model for predicting ln(γij∞). (a) First, each molecular graph (i.e., solvent and solute) is passed through the corresponding graph neural network. Here, the node features of the graphs are updated at each layer l according to the message passing scheme described by eqn (1). After L convolutional layers, each node contains information of its L-level neighbourhood. Then, the final graphs are obtained by averaging the L graphs generated during the convolutions (jumping knowledge50). Afterwards, the molecular fingerprints are obtained from the global pooling layer Set2Set.51 Finally, the binary-system representation Hsys is obtained according to eqn 2, and a multi-layer perceptron maps it to the activity coefficient prediction. (b) Representation of the update performed to a node v (here, corresponding to the central carbon of 1-butanol) at convolution layer l. The neighbouring edge features evw(l) (blue vectors) are transformed by the multi-layer perceptron ϕE to perform the dot product with the embeddings of the neighbouring nodes hw(l). The previous embedding of node v is multiplied by matrix W and added to the neighbouring features transformation to obtain the new updated features vector hv(l+1) (green vector). |
Each convolutional layer l and its corresponding updated graphs are depicted with distinct colors in Fig. 3a. The operations involved in the message passing scheme (eqn (1)) are represented in Fig. 3b for the central node of the molecular graph. First, the vectors evw(l) representing the two chemical bonds connecting the central node are passed through the function ϕE(·) and are combined with the neighbouring node vectors hw(l) using the sum function. This “neighbourhood message” is then added to a linear transformation of the node hv(l) in which the message passing is being performed. The resulting vector hv(l+1) defines the representation of the central node for the updated graph in the next layer (depicted in green in Fig. 3b).
After L convolutional layers, L updated graphs have been generated. While some approaches only use the last updated graph to withdraw the final molecular representation,42,48 it has been shown that using the information contained in the progressive L updated graphs benefits the predictions of the GNN.50 This concept is known as “jumping knowledge”. In this work, the final graph is obtained by taking the mean of the L updated graphs. Then, this final graph is passed through a Set2Set51 global pooling layer that returns a single vector, representing the corresponding molecular species (i.e., the molecular fingerprint). This pooling model is invariant to permutation, and possesses more expressive power than the simple sum or average pooling.48 Then, after the molecular fingerprint of the solute Hsolute and solvent Hsolvent are obtained, these are concatenated into a single vector Hsys, which can be interpreted as the binary-system fingerprint,
Hsys = Hsolvent‖Hsolute | (2) |
The key advantage of this approach is that the complete framework can be trained as an end-to-end system using backpropagation according to a specified error metric. This allows for the GNN-based model to select relevant features at the level of the solute and solvent fingerprints via the graph convolutional operations (potentially related to the molecular-shape contribution to nonideality), and to learn the important solvent–solute interactions using the binary-system fingerprint via the final multi-layer perceptron (potentially related to the contribution of intermolecular interactions to nonideality). This is in contrast to the manual selection of groups and the empirical determination of binary group-interaction parameters that characterize most successful group contribution methods.11,30–32 This is a cumbersome task that has been performed by only a few experts in the area for decades.
The complete dataset involves only molecular solvents. Other chemical species, such as ionic liquids, were not included in the present study. However, the extension of the present methodology to ionic liquids would be straightforward if a proper database was used. We select the hyperparameters based on the model performance on the validation set, and report results on the test set unless otherwise stated. The model is implemented in Python 3.8 using PyTorch Geometric.49 To ease the training, we add a batch normalization layer53 and we use a dropout54 of 0.1 after each convolutional layer (eqn (1)) to prevent overfitting. The activation function Leaky ReLU was used after each layer (eqn (1)), with the exception of the last convolutional layer. The optimizer Adam55 was used to minimized the mean squared error (MSE) loss function using batches of 32 binary systems for 200 epochs. The learning rate is reduced if the validation error does not decrease (using a threshold of 10−4) for 3 consecutive epochs by a rate of 0.8. The optimal hyperparameters are obtained using Bayesian optimization and are explained in detail in the ESI.† All the experiments were performed on a single NVIDIA Tesla P100 GPU (16 GB).
In this work, we have also studied the performance of parallel hybrid models by coupling our proposed GNN-based method to the considered 8 mechanistic models described in the introduction. This coupling is achieved by adding the residual predictions of the GNN to the predictions of the mechanistic model in order to obtained the final γij∞ prediction. The performance of these hybrid models have been compared to the purely mechanistic predictions and the purely GNN predictions achieving lower prediction errors in several cases. The architecture of the GNNs were kept the same as described in Section 2.4 for each of the 8 cases. However, this time our proposed GNN-based model, arranged as an ensemble of 30 independent GNN models, was trained on the residuals rij rather than in ln(γij∞) according to eqn (3).
rij = ln(γij∞)exp − ln(γij∞)mech | (3) |
Model | Systems covered | MAE | SDEP | MSE | RMSE | R 2 | MAPE |
---|---|---|---|---|---|---|---|
Hildebrand | 54.66% | 2.55 × 105 | 9.89 × 106 | 9.79 × 1013 | 9.90 × 106 | −7.92 × 109 | 4.26 × 105 |
HSP | 56.23% | 16.03 | 122.37 | 15232.31 | 123.42 | −0.27 | 66.95 |
UNIFAC (Ly) | 94.52% | 10.61 | 59.28 | 3626.24 | 60.22 | 0.56 | 32.99 |
UNIFAC | 94.52% | 10.68 | 60.66 | 3794.03 | 61.60 | 0.54 | 32.37 |
COSMO-RS | 97.22% | 10.78 | 67.18 | 4628.81 | 68.04 | 0.43 | 28.43 |
UNIFAC (Do) | 94.91% | 8.54 | 56.82 | 3301.85 | 57.46 | 0.59 | 26.28 |
Abraham | 44.27% | 4.18 | 33.58 | 1144.97 | 33.84 | 0.90 | 22.05 |
MOSCED | 46.12% | 3.15 | 13.49 | 191.80 | 13.85 | 0.44 | 20.58 |
GNN single (train) | 4.07 ± 0.57 | 31.06 ± 9.83 | 1077.93 ± 757.29 | 31.33 ± 9.82 | 0.88 ± 0.08 | 13.80 ± 0.52 | |
GNN single (valid) | 3.83 ± 1.59 | 20.31 ± 11.90 | 571.35 ± 573.56 | 20.70 ± 11.96 | 0.89 ± 0.09 | 18.65 ± 2.69 | |
GNN single (test) | 100% | 4.26 ± 0.60 | 29.60 ± 6.22 | 933.44 ± 419.07 | 29.91 ± 6.24 | 0.78 ± 0.1 | 25.27 ± 1.14 |
GNN single (complete) | 4.09 ± 0.49 | 30.37 ± 8.32 | 1008.47 ± 610.84 | 30.65 ± 88.31 | 0.87 ± 0.08 | 16.48 ± 0.61 | |
GNN ensemble (train/valid) | 3.48 | 25.79 | 677.18 | 26.02 | 0.92 | 12.19 | |
GNN ensemble (test) | 100% | 3.91 | 26.73 | 729.69 | 27.01 | 0.82 | 22.66 |
GNN ensemble (complete) | 3.57 | 25.98 | 687.68 | 26.22 | 0.91 | 14.29 |
Given that different coverage of the whole data set is achieved by the different models (cf.Table 3), the comparison among them is difficult. While the performance results of the 8 mechanistic models are reported for their corresponding complete feasible dataset, the GNN results are reported for each of the train/validation/test splits. Also, the performance of the GNN models (trained only on the training set) on the complete dataset is reported for comparison. This is indicated with the word complete in Table 3. By considering the GNN results in the test set, the individual GNN model achieves a lower MAE than all the mechanistic models except for the specialized Abraham and MOSCED methods (cf.Fig. 4a). It is clear however, that precisely these two mechanistic models have a low coverage of predictable systems, 44.3% and 46.1% compared to the 100% of the GNN. This same behavior can be observed by looking only at the systems that can be predicted by the corresponding mechanistic model and are contained in the GNN test set (i.e., the intersection of feasible systems in the test set) as shown in Section S4.4 of the ESI.† Therefore, considering only the models that can cover more than 90% of the binary systems, the individual GNN outperforms them according to all the reported metrics. Nevertheless, the mean average percentage error (MAPE) of UNIFAC-Dortmund lays within one standard deviation of the individual GNN prediction (cf.Table 3). We can then conclude that according to the MAPE the individual GNN performs similar to the state-of-the art method UNIFAC-Dortmund. However, given that the MAE of the GNN is significantly better than UNIFAC-Dortmund, we can say that the GNN predicts highly non-ideal systems (where the absolute error is larger) better than UNIFAC-Dortmund. By contrast, if we consider the GNN ensemble model, this outperforms all the mechanistic models covering more than 90% of the data in all metrics, and even the Abraham model except in terms of R2 and MAPE (cf.Fig. 4a). A robustness study of the GNN predictions using a 5-fold cross validation is available in Section S4.7 of the ESI.†
In Fig. 5, we show a comparison of the GNN ensemble method to the UNIFAC-Dortmund model in the form of a parity plot for all feasible systems. It can be seen that the GNN ensemble reduces the number of outliers considerably, and that the overall spread of the predictions from the perfect model (gray line) is much lower than the one achieved by UNIFAC-Dortmund. Comparison parity plots for all mechanistic models are available in Section S4.4 of the ESI† showing only feasible systems (for the corresponding mechanistic model) contained in the test set.
Fig. 6 shows the performance of all models (except the Hildebrand model baseline) according to their MAPE. The results are shown for the corresponding test data set only. It can be seen that the GNN performs significantly better that the mechanistic models when only the model's feasible data is considered. The exception to this is again for the specialized MOSCED method that has a low coverage of systems as discussed in Section 3.1. The hybrid GNN models further improve this performance in all cases except for the UNIFAC models. Similar behavior was encountered by the comparison between bagging and boosting models for UNIFAC-Dortmund using the matrix completion methodology.40 This demonstrates that, in general, by combining both mechanistic and data-driven models more reliable predictions are achieved. Another advantage of this approach is that the level of influence of the data-driven correction to the mechanistic model (mech) can be easily controlled by an applicability domain coefficient β introduced in eqn (4).
ln(γij∞)pred = ln(γij∞)mech + βrij | (4) |
As pointed out by other works,68,69 the fact that the applicability domain is accessible during the predictions of data-driven models is of key importance for their application in the real-world.
The prediction errors Δln(γij∞) = ln(γij∞)exp − ln(γij∞)pred of the mechanistic, GNN ensemble and hybrid GNN models are shown in Fig. 7 in the form of density plots. The Hildebrand and the HSP models are not shown here due to their poorer performance compared to the other mechanistic models and the evident superior performance of the GNN-based models when compared to these ones. It can be seen that the highest density of all models is centered close to the null-error. The only exception is the UNIFAC models, which present a clear deviation towards positive errors from the experimental activity coefficient values. The density of errors in the case of COSMO-RS improved considerably when using the combined hybrid GNN model. This behavior can possible be explained by the different sources of information that this model is using. On the one hand, the predicted charge density of the molecules and the intermolecular interactions are, in certain degree and according to theoretical foundations, embedded into the mechanistic part. On the other hand, the 2-dimensional structural information of the molecules and the interactions predicted from the binary-system fingerprint are included by the GNN model, acting as an effective data-based correction.
For illustration, the parity plot comparing COSMO-RS predictions to the GNN ensemble and the hybrid GNN is provided in Fig. 8. It can be seen that the hybrid approach has a lower distribution error across the range of γij∞ studied. This precise combination of two clearly different information sources is not present in the case of UNIFAC models with the same clarity. A possible reason for this is that both UNIFAC-based and GNN-based models are heavily relying on 2-dimensional structural information not complementing each other with a clear advantage. This connection between UNIFAC groups and the GNN convolutions has been already reported,42 especially in the context of higher-order GNNs.70 Comparison parity plots for all mechanistic models are available in Section S4.4 of the ESI† showing only systems contained in the test set.
Footnote |
† Electronic supplementary information (ESI) available: Composition density of train/test data sets; hyperparameters tuning; performance metrics; data set description; additional results. See DOI: 10.1039/d1dd00037c |
This journal is © The Royal Society of Chemistry 2022 |