Trevor N.
Brown
*a,
James M.
Armitage
b,
Alessandro
Sangion
a and
Jon A.
Arnot
acd
aARC Arnot Research & Consulting, Toronto, Ontario M4C 2B4, Canada. E-mail: trevor.n.brown@gmail.com; alessandro@arnotresearch.com; jon@arnotresearch.com
bAES Armitage Environmental Sciences, Ottawa, Ontario K1L 8C3, Canada. E-mail: aesenvsci@outlook.com
cDepartment of Physical and Environmental Sciences, University of Toronto Scarborough, Toronto, Ontario M1C 1A4, Canada
dDepartment of Pharmacology and Toxicology, University of Toronto, Toronto, Ontario M5S 1A8, Canada
First published on 26th September 2024
Per- and polyfluoroalkyl substances (PFAS) are chemicals of high concern and are undergoing hazard and risk assessment worldwide. Reliable physicochemical property (PCP) data are fundamental to assessments. However, experimental PCP data for PFAS are limited and property prediction tools such as quantitative structure–property relationships (QSPRs) therefore have poor predictive power for PFAS. New experimental data from Endo 2023 are used to improve QSPRs for predicting poly-parameter linear free energy relationship (PPLFER) descriptors for calculating water solubility (SW), vapor pressure (VP) and the octanol–water (KOW), octanol–air (KOA) and air–water (KAW) partition ratios. The new experimental data are only for neutral PFAS, and the QSPRs are only applicable to neutral chemicals. A key PPLFER descriptor for PFAS is the molar volume and this work compares different versions and makes recommendations for obtaining the best PCP predictions. The new models are included in the freely available IFSQSAR package (version 1.1.1), and property predictions are compared to those from the previous IFSQSAR (version 1.1.0) and from QSPRs in the US EPA's EPI Suite (version 4.11) and OPERA (version 2.9) models. The results from the new IFSQSAR models show improvements for predicting PFAS PCPs. The root mean squared error (RMSE) for predicting logKOWversus expected values from quantum chemical calculations was reduced by approximately 1 log unit whereas the RMSE for predicting logKAW and logKOA was reduced by 0.2 log units. IFSQSAR v.1.1.1 has an RMSE one or more log units lower than predictions from OPERA and EPI Suite when compared to expected values of logKOW, logKAW and logKOA for PFAS, except for EPI Suite predictions for log KOW which have a comparable RMSE. Recommendations for future experimental work for PPLFER descriptors for PFAS and future research to improve PCP predictions for PFAS are presented.
Environmental significanceQSPR predictions for partitioning properties of PFAS have been found to be inaccurate due to their unique properties and a lack of experimental data. Accurate values for partitioning are important for assessing the hazard and risk that PFAS pose to humans and ecosystems. This work finds that by leveraging recently published experimental data, QSPRs can be recalibrated to produce more accurate predictions for partitioning properties. The model validations and mechanistic insights from this work can be used to guide the improvement of other QSPRs, and to prioritize further experimental work. |
Two general approaches for predicting PCPs are quantitative structure–property relationships (QSPRs) and poly-parameter linear free energy relationships (PPLFERs). QSPRs are developed from experimental datasets for a property of interest in which the molecule (structure) is expressed in terms of structural fragments, topological descriptors, or whole molecular descriptors.22,23 The fragment-based QSPR approach is used in several models within the US EPA's Estimation Program Interface Suite (EPI Suite™) program24 and for predicting PCPs and solute descriptors in the Iterative Fragment Selection Quantitative Structure–Activity Relationships (IFSQSAR) system developed by Brown and colleagues.25,26 The QSPRs in the US EPA's Open (Quantitative) Structure–activity/property Relationship App (OPERA) use a nearest neighbours approach by selecting the best descriptors for each property from a pool of fragments, topological, and other descriptors.23 PPLFERs are models calibrated with experimental data for PCPs and empirically-derived or predicted solute descriptors that correlate with molecular interactions of a solute of interest, e.g., a PFAS chemical, in a system of interest, e.g., octanol–water. There are two sets of PPLFER solute descriptors for PFAS in the literature. The original set of solute descriptors was developed by Abraham and colleagues and incrementally expanded over many years,27,28 exemplified by a recent publication of values for fluorotelomer alcohols (FTOHs).29 In the database of reliable solute descriptors developed in previous work30,31 there are 385 solutes containing fluorine atoms, 180 of which meet the definition of PFAS proposed by Gaines et al..17 Another set of solute descriptors was developed by Goss and colleagues,32–34 and is exemplified by the recent publication of Endo which presents values for PFAS, including FTOHs.35 There are 47 fully calibrated solutes in the Endo set, all of which meet the Gaines et al.17 definition of PFAS. It should be noted that all the PFAS calibrated by Endo are neutrals, so the results of this work are only applicable to neutral PFAS. The two sets of solute descriptors are calibrated using the same types of data, i.e., gas chromatogram (GC) retention data and partition ratios in reference solvent-air or solvent-water systems.28,36 In the calibration some solute descriptors are fixed or directly measured while the others, primarily the hydrogen bonding and polarity descriptors, are simultaneously calibrated to fit the experimental data. Because of this simultaneous calibration differing assumptions and fixed values can result in different calibrated values. The primary difference between the two sets of solute descriptors is one solute descriptor with a fixed value, McGowan volume, to which Goss and colleagues made an alteration to bring it more in line with liquid molar volume, which improved predictive power for PFAS.32 However, this difference propagates through the calibration process causing the other solute descriptors to also be different between the two PFAS solute descriptor sets.
The goal of this work is to improve the IFSQSAR QSPRs to predict solute descriptors and subsequently parameterize PPLFER equations for common PCPs of PFAS by leveraging recently published PFAS solute descriptors and other data. To achieve this, we investigate the differences between the competing PFAS solute descriptor sets and make recommendations for selecting the most reliable values. This analysis leads to updated QSPRs for solute descriptors and refitted PPLFER equations which are implemented in IFSQSAR v.1.1.1 models for several PCPs. The new IFSQSAR property predictions are compared to predictions from the previous IFSQSAR version (v.1.1.0), predictions from QSPRs in EPI Suite and OPERA, COSMOtherm calculations, and to available measured data. Finally, recommendations to further improve the prediction of PCPs for PFAS using PPFLERs are provided.
logK = eE + sS + aA + bB + vV + c | (1) |
logK = eE + sS + aA + bB + lL + c | (2) |
logK = sS + aA + bB + vV + lL + c | (3) |
CAS | Name | Sourcea | E | S | A | B | V or V F | L | Ref. |
---|---|---|---|---|---|---|---|---|---|
a Competing group of solute descriptors, Abraham and colleagues,29 or Goss and colleagues,32 exemplified by the recent publication of Endo.35 b Endo did not measure the E descriptor, E values used in this work are calculated from measured refractive indexes from other sources,32 or refractive indexes predicted using ACD Labs. The calculation of E includes the V descriptor, so an adjustment is made to the value when using VF. | |||||||||
2043-47-2 | 4:2 FTOH | Abraham | −0.51 | −0.43 | 0.84 | 0.41 | 1.172 | 2.520 | 29 |
2043-47-2 | 4:2 FTOH | Endo | −0.67b | 0.35 | 0.60 | 0.31 | 1.352 | 2.421 | 35 |
111-27-3 | 1-hexanol | 0.21 | 0.42 | 0.37 | 0.48 | 1.013 | 3.610 | 42 | |
647-42-7 | 6:2 FTOH | Abraham | −0.83 | −0.89 | 0.79 | 0.54 | 1.525 | 2.960 | 29 |
647-42-7 | 6:2 FTOH | Endo | −1.04b | 0.35 | 0.60 | 0.31 | 1.785 | 2.997 | 35 |
111-87-5 | 1-octanol | 0.2 | 0.42 | 0.37 | 0.48 | 1.294 | 4.619 | 42 | |
678-39-7 | 8:2 FTOH | Abraham | −1.19 | −1.25 | 0.82 | 0.51 | 1.875 | 3.470 | 29 |
678-39-7 | 8:2 FTOH | Endo | −1.87b | 0.35 | 0.60 | 0.31 | 2.217 | 3.554 | 35 |
112-30-1 | 1-decanol | 0.191 | 0.42 | 0.37 | 0.48 | 1.576 | 5.610 | 42 | |
307-34-6 | Perfluorooctane | Abraham | −1.130 | −1.45 | 0 | 0.42 | 1.554 | 2.165 | 43 |
375-96-2 | Perfluorononane | Endo | −1.819b | −0.19 | 0 | 0 | 2.131 | 1.571 | 35 |
111-65-9 | n-octane | 0 | 0 | 0 | 0 | 1.236 | 3.677 | 44 | |
111-84-2 | n-nonane | 0 | 0 | 0 | 0 | 1.377 | 4.182 | 44 |
Each competing set of solute descriptors can be compared to the values and trends observed for other chemicals in the database of reliable solute descriptors. This comparison shows that the competing solute descriptor sets have different merits and limitations, but this comparison does not conclusively favor one set or the other. More details about this comparison can be found in Section SI1.†
The competing solute descriptor sets can also be compared to the theoretical background and original calibration data of PPLFER solute descriptors to see which is most supported. Using gas–liquid chromatography (GLC) data for various polar stationary phases Abraham et al. recalibrated the S descriptor (then named πH2) for about 400 solutes46 from a previous version developed by Kamlet and Taft.47,48 This dataset was used to calibrate system parameters for various partitioning systems which have subsequently been used to calibrate the rest of the ca. 8000 solutes in the Abraham database. The dataset notably contains no negative values of S, and very few fluorinated chemicals. Only three solutes have a high degree of fluorination: 2,2,2-trifluoroethanol, hexafluoropropan-2-ol, and dodecafluoroheptan-1-ol. These are assigned S values of 0.60, 0.55, and 0.55 respectively, values which are more in line with the values of the Goss/Endo set. The strongly negative values of S suggested by the Abraham group are clearly outside of the calibration range of the original dataset of S values.
In early versions of the PPLFER equations Abraham used liquid molar volume (MV[l]) as a solute descriptor, but after some analysis in collaboration with McGowan it was decided to change from MV[l] to V.41 Abraham concluded that intrinsic molar volume was a better metric for the cavitation energy and found that using MV[l] instead of V while calibrating PPLFER equations would produce different system parameters s, a, and b which capture the molecular effects of dipolarity/polarizability and hydrogen bonding.41 There is more discussion on this topic in Section SI1.† When Goss et al. adjusted the V parameter for fluorine atoms they did so to make VF for fluorinated solutes fall into the general correlation between V and MV[l] for organic compounds,32 which is a major departure from the theoretical background of PPLFERs. Based on recent work26 other atoms with discrepancies between V and MV[l] have been identified, see Section SI1 and Fig. S2† for more discussion. Chlorine and bromine atoms also have a discrepancy, though smaller than fluorine, while boron and silicon have an even larger discrepancy than fluorine. The solute descriptors of organosilicon compounds have also been noted to be anomalous.34
More details of the QSPR development are available in the literature.25,26 Briefly, IFSQSAR QSPRs are group contribution models in which counts of molecular fragments are multiplied by coefficients and then summed to obtain predictions. The model development proceeds by first creating a pool of molecular fragments from the dataset to be predicted and then rational splitting into training and external validation datasets ensuring that the maximum possible number of fragments are represented in both the training and validation datasets. The training dataset is then split into internal cross-validation datasets. Model parameter selection proceeds by choosing fragments from the pool based on predictive power assessed by cross-validation, beginning with simple fragments and proceeding to more complex fragments. The model parameters are fitted by multiple linear regression (MLR) of fragment counts against the expected property values. The final steps are to define the AD and quantify the prediction uncertainty.
The existing IFSQSAR QSPRs can be updated with new data in two different ways. In the first method, the selected fragments are kept unchanged and new data are inserted into the existing training, validation and cross validation datasets, then MLR is applied to update the regression coefficients. In the second method the new data are again inserted into to the existing training, validation and cross validation datasets, but the selected fragments are reset, and new fragments are selected from the original pool to train a new QSPR. The selected fragments would then be different because of the new data.
Partition ratios calculated with the solute descriptor sets and PPLFER equations were compared to the experimental data. More discussion of this comparison can be found in Section SI1,† but in brief the results are consistent with previous analyses by Endo and Goss,34 which found that the PPLFER equations make the most accurate predictions for PFAS when they have been included in the training set of the PPLFER equation, and when eqn (3) is used. Based on the stability of the calibration and the fit with experimental data the Endo set is recommended. Including PFAS in the calibration dataset of PPLFER system parameters is required for making accurate predictions for this chemical class, as demonstrated here and in previous work by Endo and Goss.34 While testing the calibration of the 47 PFAS solute descriptors from Endo it was confirmed that the accuracy of the partition ratio calculations was best when using VF instead of V. However, this was only true when using eqn (3). Using the Endo solute descriptors with eqn (1) or (2) resulted in a larger deviation between calculated and experimental values.
The 47 PFAS calibrated solute descriptors from Endo have been added to the database of reliable solute descriptors used to train the IFSQSAR QSPRs using eqn (3). The measured and curated partition ratio data in Endo35 have also been added to the system parameters training dataset. However, the IFSQSAR training and external validation datasets already contain 383 fluorinated solutes, some of which are from previous work of the Goss group,32,34,49 but most have been calibrated by the Abraham group. Removing so much data and adding only 47 solutes from Endo would likely degrade the predictive power of the QSPRs, so we only removed some of the fluorinated solutes calibrated by Abraham group based on reasonable selection criteria derived from the comparison between the two sets.
Some solutes fail both selection criteria, so the total number of PFAS solutes removed is 64. After applying the selection criteria there are 321 fluorine-containing solutes from the Abraham set remaining in the database of reliable solute descriptors, 116 of which meet the Gaines et al.17 definition for PFAS. The PFAS from the Abraham set are mostly small solutes that are likely refrigerants. The 47 Endo set solutes are added to these bringing the number of PFAS in the database of reliable solute descriptors to 163.
System | s | a | b | d | v | l | c | Total s.e. |
---|---|---|---|---|---|---|---|---|
a The standard error (s.e.) for each system parameter is shown in parentheses. b System parameter corresponding to the term (A·B)0.5. c System parameters calculated by thermodynamic cycle. Total s.e. and s.e. of the coefficients are estimated by propagation of uncertainty. | ||||||||
logKAW | −2.127 (0.038) | −3.69 (0.038) | −4.783 (0.037) | 2.505 (0.047) | −0.445 (0.013) | 0.504 (0.027) | 0.153 | |
logKOA | 0.475 (0.05) | 3.566 (0.052) | 0.885 (0.049) | 0.109 (0.059) | 0.892 (0.016) | −0.166 (0.028) | 0.156 | |
logKOW | −1.219 (0.035) | −0.058 (0.028) | −3.579 (0.034) | 2.702 (0.047) | 0.341 (0.012) | 0.326 (0.025) | 0.162 | |
Dry logKOWc | −1.652 (0.063) | −0.124 (0.064) | −3.898 (0.062) | 2.614 (0.076) | 0.447 (0.021) | 0.339 (0.039) | 0.219 | |
log KO[w]O[d]c | 0.433 (0.072) | 0.066 (0.07) | 0.319 (0.07) | 0.087 (0.089) | −0.106 (0.024) | −0.013 (0.046) | 0.272 | |
logVP[l]c (Pa) | −1.309 (0.098) | −0.983 (0.217) | −0.558 (0.118) | −1.629 (0.257) | −0.779 (0.132) | −0.655 (0.035) | 7.084 (0.068) | 0.607 |
logSW[l] (mol L−1) | 0.831 (0.091) | 2.707 (0.213) | 4.218 (0.112) | −1.629 (0.257) | −3.316 (0.124) | −0.206 (0.033) | 0.194 (0.062) | 0.587 |
logSO[d][l]c (mol L−1) | −0.821 (0.11) | 2.583 (0.223) | 0.32 (0.128) | −1.629 (0.257) | −0.702 (0.145) | 0.241 (0.039) | 0.532 (0.073) | 0.627 |
log SO[w][l]c (mol L−1) | −0.388 (0.097) | 2.649 (0.215) | 0.639 (0.117) | −1.629 (0.257) | −0.615 (0.132) | 0.136 (0.035) | 0.519 (0.067) | 0.609 |
The fragment-based QSPRs for the individual solute descriptors have also been updated, the QSPR training and validation datasets in IFSQSAR have been updated by removing and adding PFAS solutes as described in Section 3.1. The Endo solutes were split between the training and validation datasets in a ratio of 2:1. Solutes were sorted, and every third solute was assigned to the validation dataset. This was done in steps, sorting by partitioning properties where available and sorting by the L solute descriptors for PFAS with no measured partitioning, ensuring that the solutes with available partitioning data were present in both training and validation datasets. The regression coefficients of the MLR for all fragments in the QSPRs were then retrained with the new datasets, no fragments were added or removed in this process.
New PCP models have been implemented in IFSQSAR which combine the retrained solute descriptors QSPRs with the recalibrated PPLFER equations (system parameters) in the form of eqn (3). A new QSPR for VF has been implemented in IFSQSAR and is now used in these models instead of the V solute descriptor. These updates are part of IFSQSAR version 1.1.1. The previous versions of the partitioning models are still available in IFSQSAR and can be applied by users for comparison by specifying the version making calculations. These new models were validated using data from Endo,35 who applied the quantum chemistry-based software COSMOtherm to predict logKOW, logKAW, and logKOA for the same solutes for which solute descriptors were calibrated, as well as a diverse dataset of other PFAS. COSMOtherm is an upper tier prediction method for PCPs and Endo found the root mean squared error (RMSE) between the predictions and PPLFER calculated values was 0.33 to 0.42 log units.
Fig. 1 Predicted vs. expected logKOW for (A) IFSQSAR old, (B) IFSQSAR new, (C) OPERA, and (D) EPI Suite. |
Fig. 2 Predicted vs. expected logKAW for (A) IFSQSAR old, (B) IFSQSAR new, (C) OPERA, and (D) EPI Suite. |
Fig. 3 Predicted vs. expected logKOA for (A) IFSQSAR old, (B) IFSQSAR new, (C) OPERA, and (D) EPI Suite. |
Fig. 1C, 2C and 3C show OPERA predictions vs. expected values for logKOW, logKAW, and logKOA of PFAS. OPERA is a nearest neighbors QSPR, where predictions are made by comparing a chemical structure to structures in the database and calculating an average value from available experimental data. These types of models have good performance when interpolating between existing data. Other researchers have found that OPERA made accurate predictions of the partitioning of PFAS when compared to experimental data.50 However, it should be noted that the PFAS used in the previous model evaluation50 were all data rich PFAS which are most likely to have neighbors with experimental partitioning data in the internal OPERA database. In fact, when OPERA identifies an exact match for the predicted chemical within its internal database, it directly provides that value from the database without averaging with the nearest neighbors. In other words, the program is selecting the experiment value, rather than making a prediction. When OPERA predictions are compared to COSMOtherm predictions for novel PFAS from Endo, and even many of the PFAS with newly fitted PPLFER solute descriptors, the OPERA predictions are not as accurate. This is expected behavior because nearest neighbor QSPRs cannot extrapolate to new chemistries, only interpolate between existing chemistries. The horizontal lines of solutes in Fig. 2C for logKAW indicate that the OPERA predictions are using a small number of experimental PFAS data to make predictions for most novel PFAS.
Fig. 1D, 2D, and 3D show EPI Suite v.4.11 predictions against expected values for logKOW, logKAW, and logKOA of PFAS. Predictions for logKOW are good, comparable to IFSQSAR v.1.1.1 with high R2 (0.86–0.99) and low RMSE (0.38–1.27) values. EPI Suite predictions for logKAW and logKOA against experimental and PPLFER-calculated values have moderate to strong correlations with R2 values of 0.90 to 0.75 but they tend to over-predict volatility, with logKAW values biased high and logKOA values biased low, resulting in high RMSE values of about 2.3 log units for both. The predictions for logKAW and logKOA of novel PFAS are poor, with high RMSE values of 2.7 to 3.3 log units compared to the COSMOtherm calculated values. It should be noted that EPI Suite v.4.11 does not have an independent logKOA QSPR, the predictions are calculated by thermodynamic cycle from logKOW and logKAW, which does not account for the discrepancy between wet and dry octanol. Clearly the fragment based QSPRs in EPI Suite can make good predictions for partitioning properties when they are calibrated with adequate data as shown by the good predictions for logKOW, but some of the QSPRs are lacking adequate calibration data for PFAS.
The plot of logKOA values calculated by IFSQSAR v.1.1.1 showed an under-prediction bias with a slope of 0.623 when compared to the COSMOtherm calculated values, see Fig. S4A.† The logKOA values did not show the same under-prediction bias when compared to the experimental and PPLFER expected values. Hammer and Endo found that COSMOtherm tends to over-predict solvent-air partitioning, which would explain some of the observed bias.45 An extensive investigation of the under-prediction bias was undertaken but the cause could not be definitively identified. There is a notable discrepancy in the difference between partitioning of PFAS in wet vs. dry octanol, i.e., octanol that is saturated with water as is the case for logKOW measurements vs. octanol that contains no water as is the case for logKOA measurements. The new IFSQSAR v1.1.1 and the COSMOtherm predictions do not match the same trends observed for other chemical classes, see Section SI2† for details. During this investigation the A solute descriptor QSPR in IFSQSAR was re-created de novo to include the new PFAS data. This improved the statistics of the external validation but did not completely resolve the under-prediction bias, as seen in Fig. 3B the slope of the logKOA plot for IFSQSAR predictions against COSMOtherm predictions was increased to 0.727. IFSQSAR v.1.1.1 uses the A QSPR that was re-created de novo, and this is what is shown in the rest of the figures and statistics in Section 3.3.
Fig. S5† compares PPLFER calculations parameterized by solute descriptors predicted in ACD Labs. ACD Labs predicts the V solute descriptor rather than VF, and PPLFER equations from our previous work calibrated to use V were applied.26,30 The PPLFER predictions for logKOW made using the solute descriptors from ACD Labs show a better match with expected values than OPERA and IFSQSAR v.1.1.0, but worse than EPI Suite or IFSQSAR v.1.1.1. For logKAW and logKOA the PPLFER predictions using solute descriptors predicted by ACD Labs are a better match with expected values than predictions from EPI Suite or OPERA but are worse than predictions from IFSQSAR v.1.1.0 or v.1.1.1.
PFAS are relatively scarce in the training datasets used to develop EPI Suite v.4.11, OPERA v.2.9, and before this work also IFSQSAR. All these models are still only able to generate predictions for the neutral form of ionizable PFAS such as perfluoroalkyl acids (PFAAs). As documented in the literature for KOW,51,52 there can be substantial discrepancies in the predicted values generated by different QSARs/software packages and also in the predicted values generated by different versions of the same QSAR/software package. For example, the predicted logKOW of the neutral form of PFOS (referred to as logKOW,N) generated by EPI Suite KOWWIN v1.67 is 6.28 but is 4.49 if using EPI Suite KOWWIN v1.68 or later versions. This is due to the inclusion of the “–CF2(–CF2)(–CF2) (linear –CF2–core)” factor in KOWWIN v1.68 and later versions.
Upon inspection of some of the property data included in the US EPA CompTox dashboard for PFAAs which uses OPERA for property prediction, there are instances of pseudo-replication (same value attributed to multiple sources), mischaracterization of predicted values as “experimental”, and inclusion of anionic (i.e., charged form/salt) property values with neutral form property values. There are also instances when the OPERA v.2.9 predicted values are exactly or nearly identical to the average of “experimental” values (but which may in fact be a set of predicted values). As explained above, this is because the average “experimental” values from the US EPA CompTox database are included in OPERA and are preferentially selected when a perfect match is found. See Section S3† for additional details. While some of these issues are not uncommon when compiling large quantities of property data from numerous sources, it is important to account for them when selecting property values.
The introduction of VF and eqn (3) by Goss, Endo and colleagues has improved the predictive power of PPLFERs for PFAS, though we view VF an interim method because it conflicts with the underlying theory as developed by Abraham and McGowan. Additionally, because we found deviations between V and MV[l] for other atoms, see Section SI1 and Fig. S2,† further adjustments may be needed or a new volume parameter selected. Another example of an adjustment made to the calculation of V is when van Noort et al. adjusted V for PCBs by fitting it along with the other solute descriptors.53 They attributed this change to steric effects when chlorine atoms are in the ortho position, but it is interesting that this adjustment was made for another class of highly halogenated solutes which have a discrepancy between V and MV[l]. Changing the V descriptor to a different characteristic volume might resolve the observed discrepancies but would require recalibrating the entire PPLFER system and the community should agree on what to use. Abraham's original criteria for selecting V are still a good guide,41 these are as follows. (1) The descriptor should correlate with the cavitation free energy, i.e., the energy required to make space for the solute in a solvent. (2) The calculation should be trivial to make calibrating the other descriptors easier. (3) The descriptor should be largely an intrinsic solute property, independent of the partitioning system. Not explicitly stated by Abraham but also important is: (4) the descriptor should be orthogonal to the other descriptors as much as possible. Criterion 2 is not very restrictive, a QSPR based on McGowan's method for calculating V can be fitted to any volume dataset, such as was done for MV[l] in our recent work.26 Molar volume at the critical point (critical volume, VC) is used as the characteristic volume in some equation of state models, and specific molecular interactions such as hydrogen bonding do not occur in the critical state so this would meet criterion 4. Partial molar volume in a reference solvent, such as water, super critical water, or some other solvent may also be an option provided that the training data and model fitting are selected to reduce the effects of hydrogen bonding and dipole interactions. van Noort used the solvent accessible volume to justify the adjustments made to V made for PCBs,53 which can be calculated by various software.
The work of Endo35 also includes another important departure from how the solute descriptors are typically calibrated. During the fitting procedure the PFAS were also included in the training dataset of the PPLFER system parameters, and both the solute descriptors for PFAS and system parameters of the PPLFER equations were calibrated together in an iterative procedure. Similar iterative fitting was done when Abraham was originally creating the PPLFER system, but the typical procedure now is to keep the system parameters fixed and calibrate only the solute descriptors,28 or keep the solute descriptors fixed and calibrate the system parameters. Endo kept the solute descriptors for non-PFAS solutes fixed, so the calibrated PFAS solute descriptors should still be consistent with the Abraham PPLFER system. There is a competing PPLFER system constructed by Poole54 which uses the same solute descriptors and equations as the Abraham system, but it uses the iterative refitting procedure to calibrate the entire system simultaneously, and the result is that the solute descriptors and system parameters are different enough that the two systems are no longer compatible. The Abraham system would likely also benefit from a full recalibration using this iterative procedure, or some other simultaneous fitting procedure.
Another aspect of the Abraham PPLFER system relevant for PFAS is the extension for ionized solutes because some of the most problematic PFAS are ionized at environmental pH. The system requires two additional solute descriptors to account for molecular interactions of ions, but the number of solutes with calibrated descriptors is still small. This system is a simplification, because it does not account for ion pairing and other process which can affect ion partitioning. Properly accounting for these additional effects is still a research need. The easiest method to calibrate the new descriptors is using empirical regressions with the other solute descriptors.55 However, the number of ionizable groups with empirical regressions is small and these are unlikely to work for ionizable PFAS. In our previous work developing empirical regressions with solute descriptors30 PFAS were strong outliers that had to be excluded. Including the extension for the partitioning of ions in a full, iterative recalibration of the Abaraham system may allow for expanding the applicability.
The solute descriptors and PPLFER equations calibrated by Endo, and those calibrated in this work, are only applicable to neutral PFAS and to the neutral forms of ionizable PFAS, so they will generally not be applicable to PFAAs. Caution and additional scrutiny are required when compiling PCP data for PFAS in general but PFAAs in particular because of their strong tendency to dissociate (i.e., pKa values < 2).56,57 Measured property data (e.g., water solubility, octanol–water partitioning, biopartitioning) for PFAAs should be assumed to predominantly represent the behaviour of the charged form unless it is explicitly stated that experimental conditions have been established such that the presence of the neutral form is favoured. Despite the extra steps taken in this work to improve the prediction of the A solute descriptor for PFAS with sulfonic and sulfonamide functional groups these still have some of the largest discrepancies between predicted and expected values for logKOW, logKOA, and logKAW. This may be because they are functional groups that are not well represented in the PPLFER training datasets, or it may be because they are ionizing. Sulfur-containing and ionizing PFAS are obvious candidates for future experimental work, measurements of the partitioning of neutral and ionized species would help to improve QSPRs and yield further mechanistic insights. However, for strongly ionizing PFAS such as those containing sulfonic acid functional groups measuring the partitioning of the neutral form is likely impossible, so some combination of theory, e.g., COSMOtherm calculations, and measurements will be required.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4em00485j |
This journal is © The Royal Society of Chemistry 2024 |