Benedikt
Winter
a,
Clemens
Winter
b,
Johannes
Schilling
a and
André
Bardow
*a
aEnergy and Process System Engineering, ETH Zürich, Tannenstrasse 3, 8092, Zürich, Switzerland. E-mail: abardow@ethz.ch
bOpenAI, 3180 18TH St, San Francisco, CA 94110, USA
First published on 29th September 2022
The knowledge of mixtures’ phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients are often limited due to the high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce a SMILES-to-properties-transformer (SPT), a natural language processing network, to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 million data points) and then fine-tune the model on experimental data (20870 data points). This training strategy enables the SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS and UNIFACDortmund, and improving on recent machine learning approaches.
To overcome the inherent lack of experimental data, predictive thermodynamic property models have been developed over recent decades for many molecular properties, e.g., COSMO-RS,3 COSMO-SAC,4 SAFT-γ Mie,5 and UNIFAC.6 These models can predict thermodynamic properties with increasing accuracy and are therefore particularly beneficial for molecule mixtures with missing experimental data. However, despite the vital advantages of predictive thermodynamic models, these models come with shortcomings. For example, calculating the surface charges of molecules for COSMO models is time-consuming, whilst UNIFAC is limited to known functional groups parametrized to experimental data. Moreover, these physically based predictive models are still less accurate than experiments.7
Computationally efficient alternatives to physically based predictive models are data-driven models using machine learning. Machine learning is currently a rising topic in chemical engineering, as summarized in multiple recent reviews8–10 that identify challenges in many areas such as optimal decision making, introduction and enforcing of physics, information and knowledge representation, and safety and trust.11 The application of machine learning has also already led to recent advances in thermodynamic property prediction. Alshehri et al.12 developed a data-driven model to predict 25 pure component properties based on a Gaussian process. The developed model surpasses classical group contribution models in accuracy. Chen et al.13 use a transformer-convolutional model to predict the sigma profiles of pure components with high accuracy.
To predict activity coefficients, matrix completion methods have been recently proposed that represent the limiting activity coefficient of binary mixtures as a matrix. In matrix completion methods, all mixtures are sorted into a solvent-by-solute matrix. Known mixtures are used to learn embeddings for each solvent/solute, which then can be used to fill the matrix by interpolating towards unknown combinations. Jirasek et al.14 proposed a matrix completion method to predict the limiting activity coefficients of binary mixtures at 298.15 K that exceeded the accuracy achieved by UNIFAC. Recently, Damay et al.15 extended the method of Jirasek et al.14 to capture temperature dependencies. The proposed model has a higher accuracy for the temperature-dependent prediction of limiting activity coefficients than UNIFAC. Chen et al.16 developed an approach to extend the UNIFAC-Il model17 for predicting limiting activity coefficients in ionic liquids by combining matrix completion with convolutional networks. These proposed approaches exceed the accuracy of the widely employed UNIFAC model in predicting limiting activity coefficients. Moreover, matrix completion approaches do not require any characterization of the molecules to train the model and predict thermodynamic properties, as the model solely learns from the correlations within the matrix. However, their lack of molecular characterization prevents matrix completion methods from extrapolating beyond the space of molecules available for training. Recently, Sanchez Medina et al.18 developed a graph neural network to predict limiting activity coefficients at constant temperature. In principle, this graph neural network is capable of extrapolating to unknown solvents and solutes, but the extrapolatory capabilities of the network were not tested. Thus, it is still unclear how well machine learning methods can extrapolate out of the realm of training data onto unknown solutes and solvents.
Here, we present a SMILES-to-property-transformer (SPT), a data-driven model with high accuracy for interpolation and extrapolation that can predict temperature-dependent limiting activity coefficients from nearly arbitrary SMILES, based on natural language processing and a transformer architecture.19 Due to their ability to learn structural relationships, transformer models have recently shown to be successful in predicting the pure component properties of various molecules and pharmaceuticals.20,21 However, transformer models require large amounts of training data, which is typically unavailable for thermodynamic properties from experiments. To overcome the lack of experimental training data, we propose a two-step approach: first, the model is trained on a large amount of synthetic data from a physically based predictive model for limiting activity coefficients to convey the grammar of SMILES and the underlying physics of activity coefficients to the model. Second, the pretrained model is fine-tuned using available experimental data to improve accuracy and reduce the systematic errors of physically based predictive models. We compare the SPT model to state-of-the-art predictive thermodynamic models and ML approaches and demonstrate its high accuracy for predicting the temperature-dependent limiting activity coefficients of unknown molecules after fine-tuning.
As the backbone of the SPT, we adopt a GPT-3 architecture decoder-only transformer26 as implemented by Karpathy27 in MinGPT with changes to the input encoding and regression-head section of the model (Fig. 1). The GPT-3 architecture shows higher accuracy than, e.g., the transformer implementation of PyTorch,28 most likely due to the use of a Post-LN transformer instead of a Pre-LN transformer.29
In the first step of our model, only the molecule’s structural information is passed to the model by constructing an input sequence from the SMILES of the solute and the solvent. Four special characters are used to signal (1) the start of the first molecule, <SOS>, (2) the middle between both molecules <MOS>, (3) the end of the second molecule <EOS>, and (4) the padding <PAD> to fill the input sequence to a fixed length of nseq, e.g.:
<SOS>,SMILESsolute,<MOS>,SMILESsolvent,<EOS>,<PAD>,… | (1) |
Next, the input sequence is tokenized by assigning a number to each unique character of the SMILES code. In general, each token could be longer than a single character per encoding. However, a single character is used in this work for simplicity. Consequently, the vocab contains the following tokens: <SOS>, <MOS>, <EOS>, and <PAD>, characters that can be contained in a SMILES code adapted from Kim et al.25 and a special token for water to clearly distinguish between pure water (SMILES “O”) and oxygen groups on hydrocarbons. Not all tokens included in the vocab are part of the molecules of our training data. Thus, the embedding of some tokens remains untrained in our final model. Including these untrained tokens makes the model easily expandable for more complex structures in later fine-tuning steps. However, evaluating SMILES that contain untrained tokens leads to unreliable results. The overall vocab and a list of trained and untrained tokens are available in ESI S1.†
After tokenizing the input sequence characterizing the solute and solvent, the input matrix X is constructed from the input sequence and the embedding matrix E. The embedding matrix E contains a learned vector of length demb for each token. The input matrix X is constructed by concatenating the embedding vectors belonging to the input tokens resulting in an nseq × demb matrix. Next, temperature information is incorporated into the model by projecting the temperature into the embedding space via a linear layer and concatenating it to the right of the input matrix. Therefore, the input matrix size is expanded to nseq+1 × demb. The input matrix now contains information about the tokens making up the molecules of the mixture and the temperature. At this stage, all tokens of the same type, e.g., ‘C’, are represented by the same token encoding, independent of their position in the molecule. However, information about the position of each token is crucial for the prediction of molecular properties. This positional information of the tokens is incorporated into the input matrix in the next step by adding the learned positional encoding matrix D of size nseq+1 × demb to the input matrix X. As the vector in the first position of D is always added to the first token, the second vector of D to the second token, and so forth, these vectors can learn specific properties of their position over time and combine them with the token information. After the positional encoding is incorporated into the input matrix, it is passed to the transformer block, the heart of the model.
(2) |
The query, key, and value matrices are calculated by multiplying the input matrix X with the learned matrices WiQ, Wik, and WiV. The matrices Qi, Ki, and Vi have the size nseq+1 × dk. The product of Qi and Ki can be interpreted as the relative importance of each token to another token. This result is normalized by the square root of the key dimension dk and passed to a softmax function returning the attention from each token to each other token. The value matrix is then multiplied with the attention, leading to the matrix Zi of size nseq+1 × dk, which contains for each token information of other tokens weighted by their importance.
The attention operation is repeated for each attention-head. The resulting output matrices Zi of each attention operation are concatenated and projected to the input size of nseq+1 × dembvia a linear layer. Finally, the data are passed through a multilayer perceptron (MLP) layer containing a GeLu non-linearity, concluding the transformer block. In the first MLP layer, the size of the model is increased by a factor of four for the embedding dimension; the second linear layer of the MLP projects it back down to the input size. Residual connections connect the input and output of the attention and MLP block, including their respective layer norms. Multiple transformer blocks can be stacked consecutively to increase the depth of the model. In this work, we use two consecutive transformer blocks. For a more in-depth and visual explanation, the reader is referred to the blog of Alammar.33
To generate the synthetic data, we use the COSMObase 2020 database. This database contains around 10000 molecules resulting in more than 100 million possible binary combinations for solutes and solvents. Calculating activity coefficients for all combinations is computationally intractable. Thus, for each of the 10000 solutes, 500 random solvents are sampled at a temperature of T = 298.15 K, resulting in around 5 million solvent/solute combinations. Furthermore, 100 of the 500 random solvents per solute are sampled at five random temperatures between 273.15 K and 598.15 K to provide temperature-dependent data. In total, around 10 million data points are sampled for pretraining, referred to as the COSMO dataset. We use the TZVDP-FINE parametrization and a maximum of 3 conformers for calculating the limiting activity coefficient of each data point.
To validate the performance of our machine learning model during the pretraining, the COSMO dataset is split into three validation sets. For this purpose, 5% of the solvents and solutes are initially removed from the training set (see Fig. 2). Crucially, preliminary tests showed that water cannot be entirely removed from the training set to ensure an accurate prediction for this notable molecule. Removing solvents and solutes from the training set enables the creation of two validation sets: first, a validation set containing the cross-section of the excluded solvent and solutes, where the training data contain neither the solvent nor the solute. This validation set tests the extrapolation accuracy of the model for entirely unknown solute/solvent combinations and is referred to as Valext. Second, a validation set is created where either solvent or solute is contained in the training set, but not both. This validation set tests the extrapolation capability of the model for one unknown molecule while the other one is known. Since the validation set tests the edge of known structures, we call it Valedge. Finally, an additional 5% of the remaining solute–solvent combinations are randomly removed from the training set. If a solute–solvent combination exists for more than one temperature, the combination is removed for all temperatures. The resulting third validation set, so-called Valint, tests the interpolation capabilities of the model when solvent and solute are known in other combinations but not in precisely this combination. This validation set is most comparable to the matrix completion approaches discussed earlier, where both mixture components have to be known.
To our knowledge, the largest publicly available dataset on limiting activity coefficients was published recently by Brouwer et al.35. This dataset contains 77173 limiting activity coefficients for various solute/solvent combinations and temperatures gathered from the literature. However, from the 77173 data points, around 52000 data points use ionic liquids or deep eutectic solvents as solvents and are thus excluded. Additionally, we excluded impure substances such as sunflower oil, solvents with specific phase orientations (nematic phase and isotropic phase), and uranium complexes. For 10 solvents/solutes, no SMILES code could be identified. Furthermore, some errors in the data by Brouwer et al.35 were corrected, such as wrong exponents, lnγ∞ instead of γ∞, misclassification, or data entered in the wrong row. A list of all changes and an updated data table can be found in ESI S2.† Overall, 20870 suitable data points are identified and used for the fine-tuning of our model. The resulting data set for the fine-tuning contains 349 solvents and 373 solutes in 6416 unique combinations in a temperature range from 250 K to 555.6 K. The distribution of the data in lnγ∞ and T is shown in ESI S3.† In the following, the dataset is referred to as the Brouwer dataset.
To test the performance of the fine-tuning, again, three validation sets are defined as for the pretraining. Due to the much smaller amount of data available from experiments, n-fold cross-validation is used to determine the accuracy of the network. Due to the small sample size of a single validation set, this approach would be expected to have a high variance (Fig. 2). To construct the training and validation sets, all solute/solvent combinations without water are split into 1000 subsets, each constructing one Valext,i. The solute/solvent combinations not part of Valext,i are assigned either to the edge validation set Valedge,i, or to the training set Traini depending on whether one or none of the two components are part of the Valext,i. Subsequently, 5% of the training set Traini is randomly sampled to yield the validation set Valint,i used to test the interpolation capability. Finally, all data points are reassessed to determine whether they have to be moved to another validation set due to the removal of Valint,i from Traini. A large number of splits is required since to test for extrapolation both solvents and solutes must be excluded from the training set. Due to the uneven distribution of the training data, moving to many mixtures into the validation set Valext makes it very unlikely that common molecules such as ethanol or hexane ever appear in the training data as they are nearly always moved into Valedge. If many of these common molecules are excluded from the training, the training data set becomes prohibitively small. For example, using 5 splits leaves only 500–600 of 21000 data points remaining in the training set (2%).
Solvent–solute combinations with water are excluded from Valext,i for two main reasons: first, the unique nature of water makes it challenging to extrapolate water properties when only organic compounds are known within the training set. Second, we believe that applications are rare where the limiting activity coefficient of the unknown and unmeasured molecule water must be predicted. While water is excluded from the validation set Valext, the validation set Valedge still contains combinations with water as a known solvent and an unknown organic solute, which we envisage as likely use-cases. The results for the validation set Valint and Valedge, including only combinations with water, are available in ESI S6.†
Due to the varying number of data points for each solute/solvent combination, the size of the training sets varies. The sizes range between 15000 and 19270 for Train, 6 and 69 for Valext, 640 and 5000 for Valedge, and 640 and 1200 for Valint.
The result highlights the high interpolation and extrapolation capabilities of our pretrained model for predicting temperature-dependent limiting activity coefficients generated from COSMO-RS. Furthermore, the machine learning model is very fast, predicting around 3000 limiting activity coefficients per second on an RTX 2080 Ti without requiring precalculation of sigma surfaces. This high speed should remove property prediction as a bottleneck and allow for the exploration of larger spaces when searching for new components.
To analyze the performance of the fine-tuned SPT model, the Brouwer dataset is first predicted using the pretrained model (Fig. 4a). The pretrained model achieves an MSE of 0.32 and MAE of 0.39, which is comparable to the accuracy of COSMO-RS for the same dataset (MSE 0.36 and MAE 0.38).
The results of the n-fold cross-validation of the fine-tuned SPT are shown in Fig. 4. For interpolation (Valint), the fine-tuned SPT archives an MSE of 0.06 and an MAE of 0.13 (Fig. 4b) and for edge extrapolation (Valedge), an MSE of 0.08 and an MAE of 0.16 (Fig. 4c). Thus, the prediction of the fine-tuned SPT model for interpolation (Valint) and edge extrapolation (Valedge) is close to an experimental accuracy of between 0.1 and 0.2.15 However, this high accuracy is only achieved if at least one of the mixture components is included in the training set. Still, for the extrapolation (Valext), the MSE and MAE are only slightly higher with values 0.12 and 0.20, respectively (Fig. 4d). Notably, even in Valext, the SPT outperforms COSMO-RS (MSESPT 0.12 vs. MSECOSMO-RS 0.36) (see Section 4).
The highest errors are mainly obtained for mixture compounds containing nitrogen and silicon. However, only a few data points are contained in the training data with molecules containing silicon. Thus, the prediction might improve with more training data. Overall, the fine-tuning improves the already high accuracy of the pretrained model for all validation sets, leading to a highly accurate prediction of temperature-dependent limiting activity coefficients. Some artifacts seen in Fig. 4 might also be the result of faulty measurements, as they come from few publications. More curated training and validation data thus might still improve prediction. The results highlight the advantages of combining synthetic and experimental data for predicting thermodynamic properties using deep learning.
Dataset | Brouwer | Medina | ||||
---|---|---|---|---|---|---|
Error | MAE | MSE | |Δlnγ∞| < 0.3 | MAE | MSE | |Δlnγ∞| < 0.3 |
COSMO-RS | 0.36 | 0.29 | 60.6% | 0.31 | 0.23 | 64.5% |
UNIFAC | 0.35 | 0.45 | 63.9% | 0.28 | 0.33 | 74.9% |
Damay et al. (on DDB) | — | — | (76.6%) | |||
Medinaext | 0.47 | 0.52 | 51.1% | |||
Medinaedge | 0.28 | 0.20 | 67.7% | |||
Medinaint | 0.19 | 0.10 | 82.8% | |||
SPText | 0.17 | 0.09 | 85.8% | 0.25 | 0.17 | 74.7% |
SPTedge | 0.13 | 0.06 | 92.5% | 0.16 | 0.07 | 86.1% |
SPTint | 0.11 | 0.05 | 94.0% | 0.13 | 0.05 | 92.5% |
Fig. 5 Cumulative distribution of the prediction error for COSMO-RS, UNIFAC, SPText, SPTedge, SPTint, Medinaext, Medinaedge, and Medinaint using a common subset of the (a) Brower or (b) Medina dataset. For Valedge and Valint the mean of the n-fold cross-validation is used. Data for Damay et al.15 are approximated from their publication and were evaluated on a different dataset. |
The physical models, UNIFAC and COMSO-RS, have very similar performance, with UNIFAC surpassing COSMO-RS slightly with 63.9% of data below an error of 0.3 compared to 60.5% for COSMO-RS on the common dataset. COSMO-SAC-based models perform substantially worse than COSMO-RS and UNIFAC (38% for COSMO-SAC2002 and 50% for COSMO-SACdsp, see ESI S7†). The SPT achieves higher accuracy than COSMO-RS and UNIFAC, even for extrapolation Valext: Valext predicts 85.8% of all data points with |Δlnγ∞| < 0.3 for the compared mixtures. The validation sets Valint and Valedge achieve even higher accuracies with |Δlnγ∞| < 0.3 for 94.0% and 92.5% of all combinations, respectively. While our ML model relies on the COSMO models to generate initial data for pretraining, the fine-tuning step on experimental data allows it to surpass the accuracy of the original COSMO models.
In a further analysis, we compare the SPT to the machine learning-based model from Damay et al.15. The authors use matrix completion and train the model to predict limiting activity coefficients from the commercial database DDB. The resulting model yields higher accuracy than the reference model UNIFAC for data taken from the DDB. The authors report that 76.6% of all data points are within |Δlnγ∞| < 0.3 when using leave-one-out validation. For qualitative comparison, the results of Damay et al. (Fig. 10 of Damay et al.15) are shown in Fig. 5a. This result is most comparable to our validation set Valint (94.0% with |Δlnγ∞| < 0.3) since matrix completion only allows interpolation when both molecules are contained within the training. However, since the authors used another non-public dataset for training (DDB), these results are not directly comparable to our results. Comparing UNIFAC to both datasets, Damay et al. report a higher accuracy of UNIFAC on the DDB dataset than we obtain for UNIFAC on the Brouwer dataset (71% with |Δlnγ∞| < 0.3 for UNIFAC on DDB vs. 63% for Brouwer). This result can indicate that the data in the DDB are of better quality. Thus, SPT’s performance may be improved when fine-tuned on the DDB data.
In contrast to matrix completion, the SPT allows for extrapolating unseen and partly unseen solute/solvent combinations. The (edge) extrapolation capacity of our model indicates a high accuracy even if compared to the interpolation accuracy of the matrix completion model proposed by Damay et al.15 with 85.8% and 92.5% of all data points with |Δlnγ∞| < 0.3, respectively. While evaluation took place on different datasets and thus results are not directly comparable, these results still strongly suggest that the SPT can achieve higher accuracies in predicting limiting activity coefficients than matrix completion, though coming at a higher computational effort.
For a consistent comparison of the SPT model and the Medina model, we split the dataset from Sanchez Medina et al.18 (Medina dataset) into 200 training and validation sets according to our validation strategy discussed in Section 2.2.2. Subsequently, we train the Medina model and our model on the resulting 200 training sets (ESI S8†). Due to the lack of a test set to stop training and adjust the learning rate, we use the performance on Valext to set the learning rate and select the epoch with the lowest mean validation MSE out of the 200 training epochs across the 200 datasets for each validation set (Valext = 117, Valedge = 135, and Valint = 163). For the SPT, we use the performance at the final epoch (50) as previously. As in Section 4.1, for Valint and Valedge, the mean of the n-fold validation is calculated and used for each unique mixture. For the Medina model, training failed on the sets 87, 115, 149 and 182 for unknown reasons, and these sets are excluded.
The MSE and MAE of the Medina model on Valint as calculated by us (MSE: 0.10 and MAE: 0.19) reproduce the MSE and MAE reported by Sanchez Medina et al.18 using random splitting (MSE: 0.10 and MAE: 0.18) (Table 1). This result indicates that random splitting results in a test set that is similar to our Valint set and random splitting is thus not suitable to assess the extrapolation capabilities of models.
Fig. 5b shows the prediction error of COSMO-RS, UNIFACDortmund, the Medina model, and the SPT model fine-tuned on the Medina dataset. The Medina dataset is reduced from 2810 mixtures to 2469 mixtures that all models can calculate.
The SPT generally outperforms the Medina model on all validation sets. For Valint, 92.5% of the data points are with |Δlnγ∞| < 0.3 for the SPT compared to 82.8% for the Medina model. For Valedge, 86.1% and 67.7%, and for Valext, 74.9% and 51.1% of data points are with |Δlnγ∞| < 0.3 for the SPT and the Medina model, respectively. The MAE of the SPT is about half the MAE of the Medina model for each validation set. Particularly, the vast difference in performance for (edge) extrapolation highlights the effective performance of the SPT when predicting new molecules. As for the Brouwer dataset (Section 4.1), the SPT outperforms COSMO-RS and UNIFAC on the Medina dataset even for extrapolation. Similarly, the Medina model outperforms COSMO-RS and UNIFAC for interpolation tasks, but performs similarly to COSMO-RS and worse than UNIFAC on edge extrapolation and is surpassed for extrapolation by both COSMO-RS and UNIFAC. Please note that it is very likely that UNIFAC parameters were fitted to mixtures contained in the Medina dataset, likely improving the UNIFAC performance for this dataset.
The results highlight the advantage of our pretraining on synthetic data to exploit scarce experimental data and extend the extrapolative abilities of our model. The obtained data-driven model shows a good understanding of molecular properties. Overall, the SPT performs slightly worse on the Medina dataset than on the Brouwer dataset, likely due to the smaller total amount of training data (2810 vs. 20870). Therefore, we analyze the data scaling of our SPT model in more detail in Section 5.
In addition to the increased accuracy, our SPT model requires 45 s to fine-tune for 50 epochs on the Medina dataset, while the Medina model requires around 4 min for 50 epochs on an RTX 2080 Ti, even though the Medina model has much fewer parameters (21000 vs. 6.5 million). The shorter training time can be vital if no GPU is available. However, the training time of the Medina model would likely be improved with the use of mixed-precision training, and the SPT requires lengthy pretraining before fine-tuning.
To determine the scaling of the fine-tuning of the SPT model, we create 200 training datasets, each containing ntrain random unique solute/solvent combinations from the Brouwer dataset for ntrain between 2 and 5000 solute/solvent combinations excluding water. The remaining solute/solvent combinations in the Brouwer dataset are then sorted into the validation sets Valext, Valedge, and Valint. For large numbers ntrain, only a few solute/solvent combinations remain in Valext and Valedge, since common molecules are likely to be included in the training dataset and thus necessarily excluded from the validation sets Valext and Valedge. For example, for 5000 training mixtures, only 17 unique solute/solvent combinations remain in the validation set Valext, across all 200 training datasets. Moreover, many of the 200 training datasets do not have a single solute/solvent combination in the validation set Valext. This small number of solute/solvent combinations for the validation set Valext leads to high variance. Thus, we only consider validation sets Valext and Valedge, where more than 17500 of the 18348 solute/solvent combinations are still present. The cutoff point is ntrain = 80 for Valext and ntrain = 500 for Valedge. For Valint the reverse is the case. Here, small ntrain leads to unreliable results and thus, no ntrain below 50 is considered.
The MAE of Valext and Valedge decreases linearly with the size of the training dataset in the log–log space (Fig. 6). The MAE of Valint decreases with a steeper slope, indicating that interpolation might be easier to learn. Furthermore, there is some indication the slope is increasing for even larger training sets. For the investigated training sizes, no saturation is visible in any validation set, indicating that the accuracy of the machine learning model still improves for increasing amounts of experimental data for fine-tuning. Following this prediction, between 10000 and 20000 solute/solvent combinations would be needed for training to reach an average MAE of lower than 0.15 for Valext, which is within experimental accuracy. The amount of required data would thus be smaller than the 31000 unique solute/solvent combinations available in the commercial database DDB, indicating that high-quality prediction of limiting activity coefficients is in reach.
Even small amounts of experimental data used for fine-tuning lead to substantial improvements in the validation set Valedge. The SPT should thus require only a few experimental data points for fine-tuning for accurate predictions around specific data points. The high performance of the SPT, even for limited experimental data available, originates from the pretraining to synthetic data, which enables learning the underlying grammar of the molecular representation and the physics provided by the predictive thermodynamic model used to generate the synthetic data. The capability of our model to accurately predict similar mixtures with only a few experimental data points could be used to guide experiments by measuring and predicting in tandem, narrowing down a target region.
The SPT machine learning model surpasses the accuracy of conventional predictive thermodynamic models such as COSMO-SAC, COSMO-RS, and UNIFAC and recently proposed machine learning approaches based on matrix completion and graph neural networks.
Combining synthetic data with scarce experimental data opens new possibilities for the training of deep learning models for thermodynamic property prediction. Even small amounts of experimental data points already lead to significant improvements in the prediction quality of the SPT. Furthermore, the main computational effort is in the pretraining of the model to synthetic data, while the fine-tuning is computationally efficient. The efficient fine-tuning opens up possibilities to combine deep learning with automated experiments, where a model is continuously refined with experimental data while providing predictions of new promising candidates to measure. Such workflows could generate machine learning models that are highly accurate in specific domains.
During the pretraining, the model builds an inherent understanding of molecules. Preliminary tests show that this understanding allows the model to learn molecular properties other than limiting activity coefficients with little experimental data. This flexibility could turn the SPT into a Swiss-army knife of molecular property prediction applicable to many tasks.
Datasets and trained models https://www.polybox.ethz.ch/
Footnote |
† Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d2dd00058j |
This journal is © The Royal Society of Chemistry 2022 |