Mengxian
Yu
a,
Qingzhu
Jia
a,
Qiang
Wang
a,
Zheng-Hong
Luo
b,
Fangyou
Yan
*a and
Yin-Ning
Zhou
*b
aSchool of Chemical Engineering and Material Science, Tianjin University of Science and Technology, Tianjin 300457, P. R. China. E-mail: yanfangyou@tust.edu.cn
bDepartment of Chemical Engineering, School of Chemistry and Chemical Engineering, State Key Laboratory of Metal Matrix Composites, Shanghai Jiao Tong University, Shanghai 200240, P. R. China. E-mail: zhouyn@sjtu.edu.cn
First published on 4th October 2024
Rapidly advancing computer technology has demonstrated great potential in recent years to assist in the generation and discovery of promising molecular structures. Herein, we present a data science-centric “Design–Discovery–Evaluation” scheme for exploring novel polyimides (PIs) with desired dielectric constants (ε). A virtual library of over 100000 synthetically accessible PIs is created by extending existing PIs. Within the framework of quantitative structure–property relationship (QSPR), a model sufficient to predict ε at multiple frequencies is developed with an R2 of 0.9768, allowing further high-throughput screening of the prior structures with desired ε. Furthermore, the structural feature representation method of atomic adjacent group (AAG) is introduced, using which the reliability of high-throughput screening results is evaluated. This workflow identifies 9 novel PIs (ε >5 at 103 Hz and glass transition temperatures between 250 °C and 350 °C) with potential applications in high-temperature capacitive energy storage, and confirms these promising findings by high-fidelity molecular dynamics (MD) simulations.
Given that the multidimensional performance objectives of novel polymers desired in various fields differ, traditional empirically oriented structural design and performance optimization are hard to manage. Emerging data science14–18 exhibits tremendous potential for generating vast amounts of hypothetical chemical structures17,19–21 and discovering promising molecules,18,22–24 among others, which is expected to accelerate the development of novel polymers while easing the burden on researchers. Data-driven molecule generation strategies are ubiquitous and promising in the fields of drugs and chemistry,20,25–28 while their application in polymer design is rare.29,30 Polymers are generally stoichiometric molecules with long chain structures in contrast to small molecules, which may increase the complexity of designing polymers to some extent. Artificial intelligence (AI) algorithms for designing small molecules combined with polymer structure approximation representations (e.g., monomer, repeating unit, BigSMILES,31 and ring repeating unit32) allow for the creation of virtual polymer libraries.33–35
When confronted with the vast structural space, a strategy to rapidly optimize multiple properties of polymers to capture the candidate structures that meet productive life demands is sorely needed. Developed with the aid of machine learning (ML)11,18,21,36,37 algorithms, quantitative structure–property relationship (QSPR)18,38 models can predict the properties of target structures in seconds, which is widely recognized. This advantage of QSPR model is recently extended by applying it to high-throughput screening of unknown structural spaces to obtain ideal candidates.7,10,12,36,39 For instance, Chen et al.40 established a QSPR model for the dielectric constant (ε) of polymers with frequency, predicted 11000 candidate polymers available for synthesis, and obtained 5 polymers with the desired ε for capacitors and microelectronic applications. In addition, Yang et al.41 developed a multi-task ML model to predict the permeability of gases such as H2, CO2, and CH4, screened more than 9 million hypothetical polymers, and identified thousands of high-performance hyper-permeable polymer membranes. In a recent eye-opening study, Gurnani et al.10 discovered a range of dielectrics in the polynorbornene and PI families using the AI scheme (including the frequency-dependent ε-QSPR model), expanding the temperature range for potential applications of electrostatic capacitors. Admittedly, the high-throughput screening paradigm based on the QSPR model greatly reduces the time required for discovering new materials,37,42 yet further validation of hundreds or thousands of promising candidate molecules still requires extensive experiments or density-functional theory (DFT) calculations.43
Considering that the vast structural screening space may exceed the predicted structural range of the QSPR model, this can lead to unreliable high-throughput screening results, which in turn increases the time for meaningless experiments or DFT calculation validation. Therefore, we introduce atomic adjacent groups (AAGs)44 to evaluate the reliability of high-throughput screening results. Both macromolecules and organic small molecules can be viewed as combinations of several groups (substructures). By confirming the distribution relationship between the molecules in the high-throughput screening structure library and the model-applicable structure space, the reliability of the high-throughput screening results will be evaluated. Furthermore, although high-throughput synthesis is a great and absolute way to extensively demonstrate the properties of target molecules, the consumption of raw materials and time remains overwhelming. If the designed monomers are hard or impossible to synthesize, it will be difficult to put the polymer candidates into application; therefore consideration of the synthetic accessibility for novel polymers is needed.45,46
A “Design–Discovery–Evaluation” scheme for exploring novel PIs with desired ε, envisioned in this contribution, is shown in Fig. 1. To ensure synthetic accessibility of the novel PI, the virtual library was created by the virtual condensation polymerization of retro-generated diamines and dianhydrides, containing over 100000 PIs. ε (ref. 47) is a key parameter for polymer dielectric materials.9,47,48 So, a frequency-dependent ε-QSPR model was developed and subjected to comprehensive validation procedures to provide performance parameters for various aspects (e.g., predictability and robustness) of the model. The ε-QSPR model was subsequently applied for high throughput predictions of polymers in the virtual library to obtain prior structures with potentially high ε. The reliability of the high-throughput screening results was evaluated on the basis of the spatial scope of the chemical structure of the polymer modelling dataset analyzed by AAG. Eventually, testable novel polymers for energy storage application were proposed, and high-fidelity molecular dynamics (MD) simulations were undertaken to verify the accuracy of the data science-centric “Design–Discovery–Evaluation” scheme.
Norm descriptors are employed for feature engineering after the PI structure of the ring repeating unit (RRU) representation, which is implemented by calculating the norm indices (see ESI eqn (S1–S6†)) of the atomic distribution matrix M (consisting of hydrogen (H)-containing and H-suppressed structures). M is obtained from eqn (1) which is regarded as a combination of the step matrix (MS) and the property matrix (MP). Furthermore, MSs (seven in total, more details are listed in ESI eqn (S7–S13†)) are derived by mapping the position of each atom in the topology and its connectivity, while MPs are obtained by mapping the fundamental properties of each atom (see ESI Table S1†). Thus, norm descriptors comprehensively reflect the features of the chemical structure in a mathematical form.
(1) |
The SHapley Additive exPlanations (SHAP)49 model interpretation technique is a tool to understand the importance of the descriptors in the QSPR models. Derived from game theory, SHAP quantitatively explains the relationship between the descriptors and the final output by calculating the Shapley value to analyse the contribution of each descriptor to the prediction.
Statistical parameters involved in model development and validation processes, such as the squared correlation coefficient (R2), are defined in ESI Table S2.†
The molecular dynamics simulation was carried out with the COMPASS52 force field using the Forcite module in Materials Studio 2019 (Dassault Systemes, France) modelling software. The structure of the RUs was constructed using the visualization module and a polymer chain consisting of 20 RUs was built using the homopolymer tool.53 Optimization of their geometry was performed using a smart algorithm. 10 PI chains were packed into a periodic box in an amorphous cell structure with an initial density (ρ) of 1.0 g cm−3. Electrostatic interactions were calculated using the Ewald summation method and van der Waals interactions were calculated using an atom-based approach. The Norse thermostat54 and the Berenson barostat55 were applied to control the temperature and pressure with a Q-ratio of 0.01 and a decay constant of 0.1 ps.
After geometrical optimization of the established periodic box, 10 NVT annealing kinetic cycles were executed at 400–800 K to explore the global energy minimum conformation. A 500 ps NVT kinetic simulation was run at 298 K for the lowest energy conformation to remove internal stresses and stabilize the system temperature within 5% of the set temperature. Finally, an 800 ps NPT kinetic simulation was run to obtain the equilibrium density curve at 298 K and 0.0001 GPa.
ε∞ = nD2 | (2) |
nD2 = 1 + 4πχ | (3) |
(4) |
Only one of the same diamines (or dianhydrides) was retained, giving unique 487 diamines and 210 dianhydrides from 1280 PIs. They were subjected to high-throughput virtual condensation subsequently and a PI virtual library was created prior to experimental synthesis. In detail, by simulating the condensation process of diamine and dianhydride, i.e., the hydrogen atoms in the amino group of the diamine are removed and the oxygen atom of ether group in the anhydride group of the dianhydride is replaced with a nitrogen atom, and then two-by-two splicing is performed, finally 102270 novel PIs are generated. A complete list presented in the SMILES form, is available in the ESI.†
The PI-ε model developed in eqn (5) involves 19 variables, one of which is a frequency-dependent descriptor, which allows the model to predict ε of PI at multiple frequencies. The reported model descriptors (I) with their corresponding coefficients (b) are shown in ESI Table S3.†
ε(freq.) = α + β × log10freq. | (5) |
(5a) |
(5b) |
ntraining = 1013; Rtraining2 = 0.9754; AAEtraining = 0.1175. |
ntesting = 243; Rtesting2 = 0.9821; AAEtesting = 0.1138. |
Fig. 3b displays the external validation results of the model. The data points are closely distributed on both sides of the diagonal line and the Rtesting2 value (i.e., 0.9821) is larger than the Rtraining2 value (i.e., 0.9754), which indicates that the model learns the underlying functional relationship between structural features and ε with high prediction accuracy. Fig. 3c–e show the validation results of LOPO-CV & LODPO-CV and their error distribution with the model. The Q2 values of 0.9654 and 0.9743 are the validation results of LOPO-CV and LODPO-CV, both of which indicate excellent stability of the model. Meanwhile, the absolute error (AE) is mostly concentrated in the range of 0–0.2 for internal validation and model results, which supports the good robustness of the established model. Furthermore, the validation results and error distribution plots show that the accuracy of LODPO-CV is higher than that of LOPO-CV, which is due to the model pre-learning the structural features, resulting in the “false high” phenomenon.
The Williams plot and the 10000 Y-random validation results are shown in ESI Fig. S3a and b.† The prediction values of the developed model for most polymers were found to be reliable, as demonstrated by the Williams plot, which supports the satisfactory predictability of the model. As shown in Fig. S3b,† the average values of RY2 (0.0151) and QY2 (0.0331) of the model are much smaller than Rtraining2 and Q2, thus ruling out chance correlation in the modelling.
To better understand the physical insight between the structure and properties captured by the model, the SHAP method was adopted to obtain the rankings of descriptor importance. As shown in Fig. 3f, the arrangement from left to right is the order of important to unimportant descriptors revealed by the SHAP method. An increase in the value of the feature (i.e., the scatter is red) results in a positive SHAP value, which usually indicates that the increase in the value of the descriptor leads to an increase in the target property value, i.e., the descriptor has a positive influence on the target properties. I1, I2 and I11 occupy the top 3 positions in the importance ranking and the number of outermost electrons (MPnoe) and ionization energy (MPion) of the atom in them may positively influence the ε of the PI. This may indicate that when designing high dielectric constant PIs, choosing atoms with larger ionization energies and a higher number of outermost electrons is more likely to be effective. Furthermore, we observe that the frequency-dependent descriptor I18 in the developed model has a negative effect on the ε of the PI, which is consistent with the experimental phenomenon that the dielectric constant decreases with frequency.
The AAG-based evaluation method workflow is shown in Fig. 4a, which evaluates the reliability of model prediction values by analysing the distribution of each AAG of the predicted polymer in the set of modelled AAGs. After distinguishing the connective atoms and the endpoint atoms in the PI-RRU structure, the connective atoms are sequentially designated as the core atoms. Centering on the core atom, the AAGs in the PI-RRU structure were divided according to the core atom, its adjacent endpoint atoms and the type of their chemical bonds, as well as its adjacent connective atoms and the type of their chemical bonds. With 55 unique AAGs for the polymers used in the modelling, Fig. S4–S6† illustrate the distribution of each AAGs, including the total number of occurrences of each AAG and the number of each AAG present in every single polymer. Benefiting from the polymer structure approximation represented by the RRU, it has connectivity and thus groups originating from the polymerization site are considered.
Based on the AAG distribution in the modelling dataset, AAG analysis was performed for 102270 PIs in the virtual library to evaluate the reliability of their predicted values. To be specific, if one AAG of the predicted polymer does not appear in the modelling process, the predicted value is considered to have “low reliability”; if at least one of the AAGs in the predicted polymers occurs in fewer modelled substances (<5) or its number is outside the upper and lower limits of the AAG counts for the modelling substances, the predicted value is deemed to have “moderate reliability”; all other cases are classified as “high reliability”.
ε = b + klog10freq. | (6) |
The variation of ε at multiple frequencies for the same PI in the modelling dataset is analysed, and the relationship between ε and frequency is approximated using eqn (6). Typically, ε decreases with the logarithmic exponential increase in frequency. Therefore, it is more realistic when the slope (k) in eqn (6) is negative. The intercept (b) in eqn (6) corresponds to ε when the frequency is 1 Hz. The dependence curves were obtained by fitting the predictedε values for each polymer against logarithmic exponential of their corresponding frequencies. As shown in Fig. 4b, the ε-frequency dependence curves of the PIs evaluated as “high reliability” have k-values less than 0, which is consistent with reality, and the b-values are between 2 and 7, which is minimal deviation from the ε-range of the modelling dataset (i.e., 1.49 to 6.53). In contrast, the ε-frequency dependence curves of the PIs evaluated as “low reliability”, and “moderate reliability” show too many outliers. Their ε-frequency dependence curves have b-values in the range of [−2, 8], which not only deviates significantly from the range of the modelling data but also appears unrealistic. This demonstrates that evaluating the reliability of QSPR model high-throughput screening results based on the AAG mitigates the issue of unreliability caused by predicted structures falling outside the judgement range of the model. The distribution of PI rated as “high reliability” at six specific frequencies of ε was analyzed to discover novel polymers with high ε potential, as seen in Fig. 4c. Even though most of the designed polymers have ε in the range of 2 to 5 at frequencies from 101 Hz to 106 Hz, there are still a few novel PIs showing high ε potential.
Fig. 5 (a) The analysis of the 9 candidate PIs with “high reliability”; (b) possible dianhydride and diamine feedstocks for the 9 candidate PIs. |
Fig. 5b summarises the possible diamine and dianhydride feedstocks for the 9 candidate PIs. Notably, the source of dianhydride for all 9 candidate PIs is likely to be 3,3′,4,4′-diphenylsulfonetetracarboxylic dianhydride. Of the 9 diamine sources, both 2,3-sulfonyldianiline and 2,4-sulfonyldianiline involve a sulfonyl group. This suggests that an increased number of sulfonyl groups in the polymerised main chain increases internal rotation, allowing easy rotation of the polymer chain, lower proximity interactions and reduced dipole moments, thus contributing to the high ε. This ensures that the phenyl groups on the main chain enhance the rigidity of the polymer, making the polymer chain segments less prone to mobility. This may explain the emergence of new PIs with high ε potential. Also, the polymerisation of the 6 potential diamine sources increases the asymmetry of the polymer chains, which enhances the polarization of the polymer and contributes to the high ε. Furthermore, these feedstocks can all be synthesized or even purchased, especially 1,2-phenylenediamine, 1,3-phenylenediamine, and 1,4-phenylenediamine, which are common dianhydride feedstocks. As a common source of dianhydride for the 9 candidate polymers, 3,3′,4,4′-diphenylsulfonetetracarboxylic dianhydride, is also an easily available feedstock. This somewhat reduces the challenge for synthesizing candidate PIs.
To further confirm the identified PIs with high ε, the RU structures of these 9 PIs were established; DFT calculations and all-atom molecular dynamics simulations were carried out for the nine novel PIs. For obtaining the values of α needed for the calculation of ε, geometrical optimization and frequency calculations were performed by the DFT method. Of note, the end saturated non-hydrogen atoms of the bonded RUs were added to the optimized RUs considering the periodicity and connectivity characteristics of the polymer (see ESI Text† for the Cartesian coordinates). The ρ values used to calculate ε were acquired by all-atom molecular dynamics simulations. The amorphous cells of 9 PI candidate materials were constructed by placing 10 polymer chains containing 20 repeating units into a periodic box with an initial ρ of 1.0 g cm−3, as shown in Fig. S7a–S15a.† Subsequently, 10 NVT annealing kinetic cycles were executed at 400–800 K to explore the global energy minimum conformation. Based on the lowest energy conformation, 500 ps NVT kinetic simulations were run to remove internal stresses and stabilize the system temperature within 5% of the set temperature. The energy stability curves of the 9 PI candidates at 298 K are shown in Fig. S7b–S15b.† After ensuring the energy stability of the systems, NPT kinetic simulations were run for 800 ps to obtain equilibrium ρ curves. As shown in Fig. S7c–S15c,† the ρ of each system increased dramatically in the first 100 ps and the ρ of each system remained essentially stable in the last 200 ps. Lastly, the average of the last 100 ps of the NPT kinetic simulation was calculated as the ρ value used to calculate ε. The molar mass of RU (Mw), α, and ρ for the derivation of ε are listed in Table 1.
PI | α | ρ | M w | α′ | N | χ | n D | ε DFT | ε QSPR | diff.g |
---|---|---|---|---|---|---|---|---|---|---|
a The unit of polarizability (α) is a.u., where 1 a.u. = 1.6487772754 × 10−41 C2 m2 J−1. The unit of density (ρ) is g cm−3, and that of Mw is g mol−1, of polarizability volume (α′) is 10−29 C2 m2 J−1, of molecular number density (N) is 1027 m−3. The rest are dimensionless quantities. b Calculated using density functional theory (DFT). c Calculated by molecular dynamics (MD) simulation. d Optical susceptibility (χ). e Refractive index (nD). f When the frequency is greater than 1014 Hz, i.e., the polarization time is 10−14 s, both orientation polarization and atomic polarization are less likely to occur, in which case ε = nD2. Thus, εQSPR is the predicted value of the model at 1014 Hz. g diff. = |εQSPR − εDFT|. | ||||||||||
PI_CD24351 | 340.2053 | 1.32 | 432.41 | 5.0439 | 1.8007 | 0.1466 | 1.6854 | 2.9035 | 3.0517 | 0.1482 |
PI_CD24416 | 437.5390 | 1.362 | 572.56 | 6.4870 | 1.4325 | 0.1521 | 1.7060 | 2.9104 | 4.0504 | 1.1400 |
PI_CD24366 | 442.9503 | 1.285 | 572.56 | 6.5672 | 1.3515 | 0.1412 | 1.6655 | 2.7740 | 3.9935 | 1.2195 |
PI_CD24719 | 336.6686 | 1.361 | 448.41 | 4.9914 | 1.8277 | 0.1476 | 1.6894 | 2.8540 | 2.9090 | 0.0550 |
PI_CD24529 | 352.9113 | 1.277 | 446.43 | 5.2322 | 1.7226 | 0.1447 | 1.6787 | 2.8181 | 2.9704 | 0.1523 |
PI_CD24352 | 341.5917 | 1.344 | 432.41 | 5.0644 | 1.8717 | 0.1572 | 1.7245 | 2.9740 | 2.7273 | 0.2467 |
PI_CD24603 | 346.5003 | 1.341 | 432.41 | 5.1372 | 1.8676 | 0.1603 | 1.7361 | 3.0139 | 2.6768 | 0.3371 |
PI_CD24495 | 423.0180 | 1.274 | 522.53 | 6.2716 | 1.4683 | 0.1499 | 1.6977 | 2.8822 | 3.5412 | 0.6590 |
PI_CD24378 | 388.1433 | 1.3 | 482.47 | 5.7546 | 1.6226 | 0.1533 | 1.7104 | 2.9255 | 3.1764 | 0.2509 |
As shown in Table 1, the molecular simulation results show that the differences (diff.) between the calculated (εDFT) and model-predicted (εQSPR) values for most of the novel PIs are within the AAE of the model, i.e., less than 0.6, indicating trustworthiness of predictions from the model. This difference may arise not only from the polymer structure, but also from errors caused by the molecular simulation method.59,60 It is noteworthy that the εDFT. of PI_CD24603 is 3.0139, which is higher than the εDFT. of the remaining 8 candidate PIs, and its potential as an energy storage material is great.
PI | Polyimide |
ε | Dielectric constant |
QSPR | Quantitative structure–property relationship |
AAG | Atomic adjacent group |
MD | Molecular dynamics |
AI | Artificial intelligence |
ML | Machine learning |
DFT | Density functional theory |
RRU | Ring repeating unit |
M | Atomic distribution matrix |
MS | Step matrix |
MP | Property matrix |
LOO-CV | Leave-out-one cross-validation |
LOPO-CV | Leave-one-polymer-out cross-validation |
LODPO-CV | Leave-one-data-point-out cross-validation |
SHAP | SHapley additive exPlanations |
RU | Repeating unit |
n D | Refractive index |
χ | Optical susceptibility |
α | Polarizability |
ρ | Density |
M w | The molar mass of RU |
ε 0 | Vacuum permittivity |
N A | Avogadro's constant |
MLR | Multiple linear regression |
R 2 | The squared correlation coefficient |
Q 2 | The squared correlation coefficient for leave-one-out cross validation (QLOO2) |
AE | Absolute error |
T g | Glass transition temperature |
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4sc05000b |
This journal is © The Royal Society of Chemistry 2024 |