Tianle Yuea,
Lei Taob,
Vikas Varshneyc and
Ying Li
*a
aDepartment of Mechanical Engineering, University of Wisconsin–Madison, Madison, WI 53706, USA. E-mail: yli2562@wisc.edu
bDepartment of Mechanical Engineering, University of Connecticut, Storrs, CT 06269, USA
cMaterials and Manufacturing Directorate, Air Force Research Laboratory, Wright-Patterson Air Force Base, OH 45433, USA
First published on 28th January 2025
Molecular generative models based on deep learning have increasingly gained attention for their ability in de novo polymer design. However, there remains a knowledge gap in the thorough evaluation of these models. This benchmark study explores de novo polymer design using six popular deep generative models: Variational Autoencoder (VAE), Adversarial Autoencoder (AAE), Objective-Reinforced Generative Adversarial Networks (ORGAN), Character-level Recurrent Neural Network (CharRNN), REINVENT, and GraphINVENT. Various metrics highlighted the excellent performance of CharRNN, REINVENT, and GraphINVENT, particularly when applied to the real polymer dataset, while VAE and AAE show more advantages in generating hypothetical polymers. The CharRNN, REINVENT, and GraphINVENT models were successfully further trained on real polymers using reinforcement learning methods, targeting the generation of hypothetical high-temperature polymers for extreme environments. The findings of this study provide critical insights into the capabilities and limitations of each generative model, offering valuable guidance for future endeavors in polymer design and discovery.
A large number of hypothetical polymer structures can provide a vast design space, which is crucial for the success of de novo polymer design strategies. As shown in Fig. 1(a) and (b), the PolyInfo15 database lists merely 18697 polymer structures. In comparison, there are around 116 million real small molecule compounds documented in PubChem16 and GDB-13 (ref. 17) offers us more than 900 million hypothetical small molecule compounds, which provide us with a vast chemical space for drug discovery. Fig. 1(c) illustrates that compared to the real polymer dataset (represented by orange and green lines), the ML-generated PI1M18 dataset offers researchers a large number of promising high Tg hypothetical candidates, as shown by the red line.19 Extensive research has been conducted on the de novo design of polymers, with researchers adopting various approaches, especially for proposing new hypothetical polymer structures, as summarized in Fig. 1(d).
For example, Sharma et al. employed a polymer building block approach and high-throughput density functional theory (DFT) to design organic polymers with high energy storage capabilities.20 Initially, repeat units were created using four building blocks within each unit, with each block selected from a pool comprising –CH2–, –C6H4–, –C4H2S–, –NH–, –CO–, –O–, and –CS–. These blocks were chosen due to their prevalence in polymer backbones. This was followed by a multi-stage screening process involving quantum mechanics-based searches and molecular dynamics techniques. The final phase included synthesizing and testing the most promising polymers, validating this approach for material selection. Similarly, Li et al. devised novel polysulfates by leveraging their knowledge of known polymer structures and the characteristics of functional groups.21 They then confirmed these structures' high glass transition temperature (Tg) and band gap (Eg) values through experimental synthesis and characterizations. The advantages of these two studies are that they allow for control over the structural complexity of the hypothetical polymers and enable the prediction of their overall properties based on the characteristics of functional groups or substructures. However, such a combination method of polymer building blocks becomes quite challenging when there is a desire to obtain a large number of candidates, in particular, on the order of millions.
To obtain candidates on a larger scale, another strategy for generating hypothetical polymer structures is based on existing small molecules and known polymerization reactions (or synthetic routes). As we mentioned earlier, numerous studies have contributed various datasets, including a vast number of small molecule structures. Taking advantage of these existing small molecules, Tao et al. generated an 8 million hypothetical polyimides and uncovered polyimides that possessed a multitude of outstanding thermal and mechanical properties simultaneously.8,22 Using diamine and dianhydride monomers sourced from PubChem, hypothetical polyimides were generated following a predefined reaction route. To efficiently screen these generated compounds, a machine learning method was employed for high-throughput evaluation. In a similar vein, Wang et al. generated 110 hypothetical polyimides by utilizing diamine and dianhydride monomers, resulting in high-temperature polymer dielectrics.6 This approach can provide a large number of candidates, but its chemical space is still limited by the small molecules used.23
With the rise of deep learning, generative models, and reinforcement learning, an increasing number of researchers are utilizing deep generative methods to expand the chemical space of various materials. This trend is particularly evident in the fields of cheminformatics and drug discovery.24–47 In polymer informatics, Ma and Luo created the PI1M dataset, comprising 1 million hypothetical polymers generated using an RNN trained on actual polymers sourced from PolyInfo.18 In their study, they compiled 12000 homopolymer structures from the PolyInfo database to train an RNN model. This training enabled the generation of 1 million new polymers, collectively referred to as PI1M. It was observed that while PI1M encompasses a chemical space similar to PolyInfo, it also fills in gaps where PolyInfo data is lacking, thereby offering a more comprehensive view of the polymer landscape.
The other researchers have directly generated hypothetical polymers with tailored properties using different deep generative models. For example, Wu et al. introduced Bayesian molecular design to discover polymers with high thermal conductivity.7 Gurnani et al. employed graph-to-graph (G2G) translation, called polyG2G, which can discern subtle chemical differences (referred to as translations) leading to significant property variations in polymeric materials.48 A latent space searching strategy is employed in this study to generate hypothetical polymers with desired properties. They then used this knowledge to sample and design new polymers with high Eg and electron injection barrier. Batra et al. utilized syntax-directed VAE in conjunction with Gaussian process regression (GPR) models to identify polymers expected to exhibit robustness under extreme conditions, such as high temperatures, high electric fields, and their combination.49 Liu et al. employed an invertible graph generative model to generate hypothetical polymers with promising properties, particularly focusing on high-temperature polymer dielectrics.50 Kim et al. employed a method of searching and decoding within the latent space offered by a VAE to generate candidates with high polymer logP values51,52 Huang et al. developed a surrogate deep neural network model to predict thermal conductivity and compiled a library of polymer units consisting of 32 sequences. They utilized two advanced multi-objective optimization algorithms: Unified Non-dominated Sorting Genetic Algorithm III (U-NSGA-III) and Q-Noisy Expected Hypervolume Improvement, for designing sequence-ordered polymers that not only exhibit high thermal conductivity but also possess feasible synthetic potential.53
When researchers intend to employ generative models in de novo polymer design, the initial step involves selecting a suitable model. However, at present, there is no work dedicated to assisting in the selection of generative models for hypothetical polymer structures. In contrast, numerous studies have been conducted to compare the performance of various models on small drug-like molecules, greatly aiding researchers in the field of drug discovery. One notable example of such a benchmarking platform is Molecular Sets (MOSES), which was developed to standardize the training and comparison of generative models for small molecules.26 Zhang et al. conducted a benchmark study with a focus on functional groups and ring systems.54 Weng et al. performed a benchmark specifically centered around biological properties.55 Recently, Nigam et al. created a set of practical benchmark tasks called “Tartarus”, which relies on physical simulations of molecular systems to emulate real-world challenges in molecular design for materials, drugs, and chemical reactions.56
Compared to small molecules, generating polymer structures involves unique complexities that demand specialized approaches and considerations. While small molecules are fully represented by their complete structures in SMILES, polymers typically consist of very large and intricate architectures. Consequently, the representation of polymers—particularly linear homopolymers—relies on identifying their repeating units and using wild cards (e.g., “*”) to denote polymerization points. On the surface, this strategy appears similar to describing small compounds. However, in practice, wild cards like “*” are not simply placeholders for arbitrary bonds. Rather, they capture specific chemical bonding patterns and the connectivity between repeating units. Therefore, models must handle the additional complexity introduced by these wild cards during the generation process. Treating “*” as a generic wild card can lead to inaccuracies in depicting polymer topologies and connectivity, resulting in invalid molecular design. As a result, conclusions drawn from studies on small molecules cannot be directly applied to the generation of hypothetical polymer structures. Therefore, there is a timely need to develop specific benchmarks and methodologies tailored to the unique challenges and requirements of the generative design of polymers.
In this study, we initially used three different polymer datasets: real polymers from PolyInfo,15 and hypothetical polyimides generated based on GDB-13 (ref. 17) and PubChem,57,58 to train six different generative models – VAE, AAE, ORGAN, CharRNN, REINVENT, and GraphINVENT. These models were trained on each dataset and generated about 10 million hypothetical polymer structures. We then evaluated these hypothetical polymer structures using the fraction of valid polymer structures fv, the fraction of unique polymer structures from a sample of 10000 f10k, the Nearest Neighbor Similarity (SNN), the Internal Diversity (IntDiv) metric, and the Fréchet ChemNet Distance (FCD). These five metrics are provided by the MOSES platform. Furthermore, the t-distributed Stochastic Neighbor Embedding (t-SNE) method was employed to visualize their chemical space distribution.
We further used reinforcement learning techniques, targeting the Tg, to train CharRNN, REINVENT, and GraphINVENT models to design hypothetical polymer structures with high Tg values. These three models are selected because of their outstanding performance based on the previous evaluation. All these models demonstrated success in generating hypothetical polymers with high Tg values after 1000-generation training. Overall, CharRNN provided us with the most favorable results. On the other hand, the effective hypothetical polymer structures generated by REINVENT show an outstanding distribution in the predicted values but have the lowest efficiency. The results of this study demonstrate the immense potential of generative models in the field of polymer informatics. They also provide valuable insights into the capabilities and limitations of various generative models within the realm of polymer science and engineering. This understanding is crucial for researchers when it comes to selecting the most appropriate generative model for their specific needs.
Besides the difference in the number of samples in these datasets, these three datasets also vary in molecular weight and the number of types of atoms. It's important to note that the molecular weight values mentioned refer specifically to the repeat unit molecular weight of the polymer. This distinction is crucial because the repeat units serve as the input for analysis and modeling in these studies.
Repeat units of real polymers from the PolyInfo database exhibit an average molecular weight of 443.7 and an average of 34.1 atoms per sample, encompassing 25 different types of atoms. In contrast, repeat units of hypothetical polyimides derived from PubChem show an average molecular weight of 530.4 and an average of 40.7 atoms, but with a limited variety of only 5 types of atoms. Repeat units of hypothetical polyimides created based on the GDB-13 have a higher average molecular weight of 645.8 and an average of 48.5 atoms per sample, featuring 18 different types of atoms. Tables S1–S3 in the ESI† provide a detailed count of each atom type present in these datasets. These factors could significantly impact the training and performance of generative models. Specifically, the average number of atoms directly affects the size of the strings and graphs used for network input, while the variety of atomic types influence the molecular design of polymers by using different deep generative models. Utilizing these three diverse datasets enable us to better explore how different generative models perform in polymer informatics. Polymer-Simplified Molecular Input Line Entry System (p-SMILES) strings are specialized string representations used to depict the chemical structures of polymers. These strings are instrumental in data-driven tasks related to polymer discovery, design, or prediction. The format of a p-SMILES string is based on the standard SMILES syntax as defined by OpenSMILES.59 However, p-SMILES introduces a unique feature to represent polymers: it includes two stars ([*] or *) within the string. These stars signify the endpoints of the polymer's repeat unit (for linear homopolymers, there are two endpoints), effectively marking the boundaries of the repeating segment in the polymer chain.
At present, large-scale generative models like Generative Pre-Trained Transformers (GPT)60 have attracted widespread attention, but their scale and cost may be daunting for some researchers, particularly those who only wish to obtain some candidates in polymer design research. In these cases, smaller-scale generative models are still a more practical and accessible option. At the same time, due to the inherent differences between polymers and small molecules, such as higher complexity, larger molecular weight, and the use of p-SMILES, not all techniques applicable for generative models of small molecules are suitable for the generation of polymer structures. For example, a structural representation method like SELFIES,61 specifically designed for small compound generation, cannot represent the repeat unit structures of polymers. Additionally, models like LatentGAN,62 developed for small compound generation tasks, are not capable of processing p-SMILES strings. In this study, as shown in Fig. 2, we selected the following six networks: VAE, AAE, ORGAN, CharRNN, REINVENT, and GraphINVENT, which are briefly discussed below.
![]() | ||
Fig. 2 Architectures of six types of deep generative models: (a) VAE, (b) AAE, (c) ORGAN, (d) CharRNN, (e) REINVENT, and (f) GraphINVENT. |
Additionally, the Nearest Neighbor Similarity (SNN) was used to calculate the average similarity of the generated polymers to the closest polymer in the test set, providing an insight into how the generated polymers compared to known structures. SNN represents the average Tanimoto similarity T(mX, mY). This similarity is calculated between the fingerprints of a polymer mX in the generated set X and its closest neighboring polymer mY in the reference dataset Y:
The Internal Diversity (IntDiv) metric, representing the average pairwise similarity among generated polymers, was included to gauge the diversity within the generated polymer structures.65 IntDiv assesses the chemical diversity within the generated set of polymers X:
Lastly, the Fréchet ChemNet Distance (FCD) was employed to quantify the difference in the distribution of the last layer activations of ChemNet,66 which is trained to predict bioactivities of about 6000 assays available in three major drug discovery databases (ChEMBL,67 ZINC,68 PubChem57), effectively measuring the disparity between the generated polymer distribution and a reference set.69 For two sets of polymers dataset X and dataset Y, FCD is defined as
After all, both t-SNE and Tanimoto similarity metrics were employed to assist in comparing differences between various polymer structures. t-SNE, a widely used technique for nonlinear dimensionality reduction and data visualization, effectively maintains nonlinear similarities between data points. It operates by initially determining the similarity between high-dimensional data points using a Gaussian distribution. Subsequently, it assesses the similarity among data points in a reduced, low-dimensional space based on a t-distribution. The goal of t-SNE is to minimize the disparity between these high-dimensional and low-dimensional similarities. These selected metrics will be employed in the initial phase of comparison to sieve out the generative models that demonstrate superior performance.
![]() | ||
Fig. 3 A comparison of the performance of the six generative models on the real homopolymer dataset collected from the PolyInfo, as well as the chemical space distribution of the generated polymers. |
In evaluating the performance of generative models using the SNN and IntDiv metrics, higher values are generally sought after. These metrics provide insights into the models' ability to generate both diverse and chemically relevant polymer structures. It can be observed that all models, except for ORGAN, exhibit results that closely resemble those in the training set.
For the FCD metric, lower values are generally preferred. This metric measures the difference in distributions between the generated polymers and a reference set, with a lower score indicating that the generated polymers are more chemically similar to real polymers. The observations indicate that, similar to the SNN metric, VAE, REINVENT, and CharRNN achieved relatively low FCD scores. AAE and GraphINVENT obtained higher scores, while ORGAN exhibited a significantly higher FCD score.
Considering all the metrics collectively, it appears that CharRNN, REINVENT, and GraphINVENT deliver the best performance, while AAE and VAE follow behind. However, ORGAN's performance leaves much to be desired. This result bears similarity to previous benchmark work based on small molecules. In MOSES, Polykovskiy et al. found that among a wide array of models, CharRNN currently outperforms others in terms of these key metrics.26 In RediscMol, Weng et al. observed that CharRNN, VAE, and REINVENT yield superior results, followed by AAE and ORGAN.55 Additionally, in studies considering ring system coverage and functional group coverage, AAE, REINVENT, VAE, CharRNN, and GraphINVENT all exhibit better performance compared to ORGAN.54
CharRNN consistently shows remarkable results in these benchmark studies, while the performance of AAE and VAE tends to be less impressive in our result. This could be attributed to the fact that the PolyInfo dataset is significantly smaller than datasets for small molecules. Additionally, the structural differences between real polymers and small molecules also play a role. The ZINC Clean Leads70 used in the MOSES project have molecular weights ranging from 250 to 350 Daltons.26 However, the molecular weight (of the repeat unit) in the real polymer dataset varies widely, ranging from 14 to 2202 dalton. This variation is due to the presence of polymers with complex structures as well as those with very simple repeat units. For example, polyethylene, the simplest polymer, has a p-SMILES representation of just ‘*C*’. The t-SNE visualization further corroborates the analysis derived from these metrics, providing a graphical representation of how well each model captures the chemical space of polymers. The individual t-SNE results for each model can be found in the ESI† for better visual comparison.
Fig. 4 presents the performance of six different generative networks when applied to the hypothetical polyimide dataset based on GDB-13. For fv, the REINVENT model achieved the highest result, nearly equal to 1. In comparison, the AAE, VAE, and CharRNN models show a similar performance level, with their values clustered around 0.7. On the other hand, the ORGAN and GraphINVENT models have considerably lower scores, below 0.2.
In f10k part, several models exhibited impressive results. REINVENT, AAE, VAE, and CharRNN all achieved a result of 1. It indicates an excellent ability of these models to generate a diverse set of polymer structures, with no duplicates in a sample of 10000 p-SMILES strings. GraphINVENT, while not reaching a perfect score, still performed commendably, with its value being close to 0.9. However, ORGAN scored below 0.7, indicating less diversity in its generated polymer structures.
For the SNN metric, it was observed that, apart from ORGAN and GraphINVENT, the other four models showed similar performance. Regarding IntDiv, all models except for ORGAN exhibited closely matched performances. These observations suggest that REINVENT, AAE, VAE, CharRNN, and GraphINVENT are capable of producing a wide variety of polymer structures, demonstrating a good internal diversity among the generated hypothetical polymers.
Observations show that, similar to the SNN metric, the models AAE, VAE, REINVENT, and CharRNN achieved relatively low FCD scores. GraphINVENT recorded a somewhat higher FCD score, indicating less chemical similarity between its generated structures and the training dataset. ORGAN exhibited a significantly higher FCD score, implying a larger disparity between its generated structures and the real-world polymers.
Fig. 5 denotes the performance of the same six generative models when applied to the hypothetical polyimide dataset derived from PubChem. It is observed that the performance and comparative results of these six models are almost consistent with those outcomes from training on the hypothetical polyimide based on GDB-13, which means the REINVENT model demonstrated the best performance. However, the performance of the ORGAN model was notably worse, to the point of being considered unacceptable for the task at hand. The specific scores for all these models and datasets are provided in ESI Tables S6–S8.†
Table 1 summarizes the performance of six models under various challenges. All the above results show that the REINVENT model shows the most favorable performance. The AAE, CharRNN, and VAE models follow closely, while GraphINVENT and ORGAN demonstrate a much worse performance. It should be noted that the performance of these six generative models varies across the three datasets, which is related to the characteristics of each dataset. Compared to the real polymer dataset, PolyInfo, the AAE and VAE show significant improvement on the hypothetical polyimide datasets. Despite the larger average atom count per repeat unit in these hypothetical datasets, which implies more complex molecular structures and longer SMILES strings, this improvement suggests that these models require a substantial amount of training data to achieve high-quality generative performance. With abundant data, they can learn more robust latent representations. Additionally, models based on variational techniques perform better on datasets derived from PubChem, which generally have simpler atomic types and lower diversity. These models excel in scenarios with limited atom types, as they likely rely more on capturing global molecular features and latent representations, which can be challenging to learn in datasets with high atomic diversity.
Models | Low data volume | High atomic count | Complex types of atoms |
---|---|---|---|
AAE | ★ | ★★ | ★★ |
CharRNN | ★★★ | ★★★ | ★★ |
GraphINVENT | ★★★ | ★ | ★ |
ORGAN | ★ | ★ | ★ |
REINVENT | ★★ | ★★★ | ★★★ |
VAE | ★ | ★★ | ★★ |
GraphINVENT performs better on the real polymer dataset compared to the other two datasets. This is likely due to the real polymer dataset having a smaller average molecular weight and number of atoms, resulting in smaller and simpler graph data structures. GraphINVENT, which rely on GNNs, are inherently well-suited to handle molecular structures directly as graphs. This gives GraphINVENT an advantage on datasets with simpler graph structures, such as the real polymer dataset, where molecular weights and atomic counts are lower. However, GNN-based models may encounter challenges when dealing with larger or more complex structures, as the molecular graph data scale of the repeat unit increases significantly with the number of atoms it contains. This is also related to the network's parameter settings. For datasets with larger molecular weights, using a larger k-hop neighbor size may improve the network's performance.
As for the two RNN-based generative networks, REINVENT demonstrates outstanding performance on both hypothetical polyimide datasets, proving its ability to handle such tasks when sufficient data is available. CharRNN, however, shows weaker performance on the polyimide dataset based on GDB-13 compared to the other two datasets. RNN-based models process molecules as continuous SMILES strings. When the average number of molecules in the dataset increases, this often results in longer string lengths, and when the diversity of molecular types increases, it means a greater variety of characters in the string. Both of these factors increase the complexity of the task. The polyimide dataset based on GDB-13 has the largest number of atom types and atomic counts, making it the most challenging task for RNN-based models. This explains the performance drop of CharRNN on this dataset, while REINVENT's results demonstrate the effectiveness of its design, which first transforms discrete chemical symbols into continuous vectors before processing them through a series of LSTM or GRU layers.
ORGAN demonstrates weaker performance across most datasets, possibly due to challenges in balancing the generative adversarial training process. GANs are known to be sensitive to training stability, especially with complex and diverse data. ORGAN may be overfitting or struggling to maintain a stable learning process, particularly on the datasets with more atomic diversity or larger molecular sizes.
We also observed that these findings align closely with the results of work of Zhang et al., particularly in their results regarding ring system coverage. In their study, they utilized the GDB-13 dataset as a training set, which happens to be one of the sources we used to generate hypothetical polyimides for our research.54
In addition to p-SMILES notation for polymers, new methods for representing polymer structures have been developed as related research progresses. One example is BigSMILES notation, which provides a more robust approach for describing polymeric systems.71–77 BigSMILES is particularly advantageous for describing network polymer systems, which are challenging for p-SMILES to capture comprehensively. Therefore, we performed benchmark tests on five models—AAE, VAE, ORGAN, CharRNN, and REINVENT—using BigSMILES on a real polymer dataset. The results indicate that using BigSMILES with the ORGAN and AAE models leads to improved outcomes, suggesting that BigSMILES may be more suitable for models that incorporate a discriminator. However, performance declined with the REINVENT and CharRNN models, likely due to the increased complexity of the character set in BigSMILES. Detailed results can be found in ESI Material Table S9.† Since this work primarily focuses on linear homopolymers, BigSMILES does not fully demonstrate its advantages here.
![]() | ||
Fig. 6 The core framework of reinforcement learning with deep generative model, and the specific data flow utilized in this study. |
Firstly, the generated p-SMILES strings are converted into 1024-bit Morgan Fingerprint (MF). These MFs are then used as input to a Feed-Forward Neural Network (FNN), which is tasked with predicting the Tg values of these candidates. Detailed information about the FNN is available in the ESI.† After obtaining the Tg predictions, a sigmoid function is applied to these values. The output of this sigmoid function is treated as the reward, which is fed back to the agent. The feedback received in the form of rewards is then used by the agent to further train and optimize its performance. Specifically, the reward informs the agent's policy updates, encouraging it to generate structures that yield higher Tg values, thereby refining its sampling strategy over successive iterations (ESI Fig. S6†).
According to the previous results, particularly the comparison of generative models trained on the PolyInfo dataset, we selected REINVENT, CharRNN, and GraphINVENT as our models of choice for generative design of new polymers. Employing reinforcement learning, we used PolyInfo as the training dataset with the goal of training these models to generate hypothetical polymer structures that exhibit high Tg values.
Fig. 7 presents the performance of these three models undergoing reinforcement learning. The leftmost part of the figure shows the change in the predicted average Tg of the generated hypothetical polymer structures across training generations, along with the predicted Tg distribution of polymers generated at the 200th, 600th, and 1000th training steps as well as the training set (0th step). As training iterations increased, it was observed that the predicted Tg values of the hypothetical polymer structures generated by all three generative models showed an upward trend. Notably, CharRNN achieved the highest average predicted Tg value at the 1000th step, while REINVENT and GraphINVENT exhibited similar performance. Additionally, the distribution of the predicted Tg values for all generated hypothetical polymers shifted towards higher values. This outcome demonstrates the capability of reinforcement learning to effectively steer the generative process towards specific target properties, in this case, achieving higher Tg in the hypothetical polymer structures.
In the middle of Fig. 7, t-SNE plot, illustrates the area covered by the training set in grey. The red, green, and yellow points represent the chemical spaces of structures generated at the 200th, 600th, and 1000th steps, respectively. The chemical space of polymer structures generated over training epochs ranging from 0 to 1200 are detailed in ESI Fig. S7.† This color gradient visually represents the evolution of the generated polymers' chemical space throughout the reinforcement learning process.
For REINVENT and GraphINVENT, it was observed that the chemical space of the newly generated polymers remained within the bounds of the chemical space covered by the original training set. As training progressed, there was a noticeable shift from larger purple-red regions to smaller, more concentrated yellow areas. Similar to REINVENT and GraphINVENT, the CharRNN model also exhibited a gradual concentration of the chemical space of the generated polymer structures during the training process. However, a distinct behavior was observed in CharRNN's approach. Unlike the other two generative models, CharRNN began within the chemical space covered by the original training set and progressively expanded its search into chemical spaces beyond what was covered in the training set. As a result, the hypothetical polymers generated by CharRNN occupied a much larger area in the chemical space.
The right panel of Fig. 7 illustrates the Tanimoto similarity between the hypothetical polymer structures generated at the 200th, 600th, and 1000th training steps and the training set, reveals an important aspect of the training process. This observation suggests that, as the models are trained, the generated polymer structures maintain a certain level of structural resemblance to those found in the initial training set. The absence of a convergence towards zero in the Tanimoto similarity indicates that the models are not diverging significantly from the structural characteristics of real polymers.
This pattern suggests that as the number of training epochs increased, both REINVENT and GraphINVENT models started to focus on generating polymer structures within specific, more defined regions of the chemical space (exploitation). This convergence towards certain areas within the training set's chemical space could indicate that the models are focusing on regions that are more likely to yield polymers with the desired high Tg values. This demonstrates that reinforcement learning strategies are effectively guiding generative models in exploring the polymer chemical space.
Meanwhile, it was observed that the results from the REINVENT and GraphINVENT models remained within the chemical space defined by the training dataset, while CharRNN showed an expansion beyond the initial training set boundaries (exploration). It's important to note that the different chemical space distributions observed do not affect the similarity of the generated hypothetical polymer structures to the training set. This is because the hypothetical polymer structures generated by the CharRNN, REINVENT, and GraphINVENT models exhibit a Tanimoto similarity that is essentially consistent with each other. The REINVENT and GraphINVENT models are particularly adept at controlling the generated structures within the confines of the training set, making them suitable choices for researchers who desire such candidates. As for CharRNN, the expansion beyond the initial training set boundaries suggests that it was exploring more novel regions of the chemical space, potentially leading to the discovery of new polymer structures with higher Tg values. This exploration outside the known chemical space is a key factor in why CharRNN's generated hypothetical polymers had overall higher mean predicted Tg values during the training.
However, as previously discussed, the limited number of polymer structures in the real polymer dataset can lead to decreased effectiveness. Ma et al. utilized RNNs and a reinforcement learning algorithm to generate hypothetical polymer structures with high thermal conductivity. They used a significantly larger training dataset (PI1M), consisting of 1 million samples, which far exceeds the size of the real polymer dataset.18 A larger dataset provides more comprehensive coverage of the chemical space in their study, allowing models to learn a wider range of patterns and features. This can lead to the generation of more unique and diverse polymer structures, enhancing the potential for discovering novel materials with desirable properties. In their study, the visualization of generated polymers alongside real polymers from PolyInfo using t-SNE showed a pattern similar to that of CharRNN's results here. The newly generated molecules show an expansion beyond the initial boundaries of the training set, indicating exploration into new areas of the chemical space. Additionally, Moret et al. have demonstrated the generation of novel small molecules with bespoke properties and structural diversity using an RNN. The chemical space explored in their research exhibits a pattern similar to what we have observed in other studies, highlighting the RNN's capability to navigate and innovate within the chemical space.78
Furthermore, graph-based generative networks for polymers, such as PolyG2G, have also exhibited outstanding performance.48 Instead of using reinforcement learning, their network employed a latent space-searching strategy to generate hypothetical polymers with desired properties. The same concept of latent space utilization is also evident in VAEs based on the inclusion of a latent space in these models' architecture.49,52 Similarly, Liu et al. utilized a graph-based invertible molecular generative model along with a latent space strategy for the design of high-temperature polymer dielectrics.50 Observations of these two graph-based generated models employing latent space strategies reveal that the main frameworks of the generated repeat units often bear resemblance to certain structures within the training set. This similarity might be a contributing factor to the close alignment of the chemical space of the generated hypothetical polymers with that of the training set.
The size of the training dataset also significantly impacts the efficiency and uniqueness of the hypothetical polymer structures generated by these models, both of which are critical factors for practical applications. Fig. 8 illustrates the comparison of these two metrics – the efficiency and non-redundancy rates – for the three networks at the 200th, 600th, and 1000th steps, as well as their overall trends throughout the training process.
![]() | ||
Fig. 8 Effect of iteration steps on the validity and redundancy rates of three generative models, CharRNN, GraphINVENT, and REINVENT, during the reinforcement learning process. |
Then the 1,000th-training-iteration models are used for 100000 hypothetical polymers generation. Fig. 9(a) displays the normalized probability density distribution of predicted Tg values of these hypothetical valid unique polymers. When the models are employed to generate a large number of hypothetical polymer structures, there is a slight shift in the mean prediction values. It is evident that the probability distributions of the three generative models are significantly different from the training set, favoring higher Tg values. Generative models based on RNN and GNN architectures have been effectively used to directly create hypothetical polymer structures with desired properties, achieving commendable results. Among them, REINVENT has the highest mean and the smallest variance in its probability density distribution, indicating that it can more stably generate many high Tg hypothetical polymer structures. CharRNN and GraphINVENT are less effective in comparison. However, it is important to note, as shown in Fig. 8, that the unique rate and validity rate of REINVENT model are relatively low. In contrast, CharRNN is considered as the best option.
Fig. 9(b) showcases the chemical space distribution of these hypothetical, valid, and unique polymers. The results align with those presented in Fig. 7, showing that among the three generative models, CharRNN generates hypothetical polymers (red points) that are the most distinct from real polymers (grey points) in terms of distance. In contrast, the polymers generated by the other two models are interspersed within the distribution of real polymers.
From these results, it is evident that CharRNN demonstrates a distinct advantage in both efficiency and uniqueness. This superiority is likely connected to its broader exploration of the polymer chemical space. As previously discussed in comparisons and discussions of various generative models, CharRNN has shown the best performance with real polymers collected from PolyInfo and a variety of small molecule tests. While REINVENT exhibits the best normalized probability density distribution, its unique rate should be considered. Hence, REINVENT becomes the optimal choice specifically when there is a requirement to generate a substantial volume of candidates, ranging from hundreds to thousands.
Subsequently, CharRNN, REINVENT, and GraphINVENT were combined with reinforcement learning algorithms using the PolyInfo dataset to generate hypothetical polymer structures with higher Tg values. All three models performed impressively, but with notable differences in their capabilities. CharRNN displayed a unique ability to extend beyond the chemical space of the training set, generating polymers with higher predicted Tg values. After training, REINVENT demonstrates the most outstanding probability distribution in its generated results. However, compared to CharRNN and GraphINVENT, it has a lower unique rate and valid outcomes.
The study underscores the need for specific benchmarks and methodologies tailored to the unique challenges of polymer design. The integration of reinforcement learning proved effective in guiding the generative process toward the desired properties, highlighting the potential of these models in future materials design and discovery. This work also leverages the power of computational modeling and machine learning, paving the way for more targeted and efficient development of new polymeric materials, such as organic photovoltaics, polymer membranes, and dielectrics.
The CharRNN model integrated with reinforcement learning was utilized, with its code accessible at the GitHub repository (version b112b811616bee01fb3348e867b7406e4e6a62f4): https://github.com/aspuru-guzik-group/Tartarus/tree/b112b811616bee01fb3348e867b7406e4e6a62f4. The REINVENT model (version 99b8f28c2a76196017eabf23118195ae546f5714), incorporating reinforcement learning, was employed, with its code available at https://github.com/MolecularAI/Reinvent/tree/99b8f28c2a76196017eabf23118195ae546f5714. The GraphINVENT model, integrated with reinforcement learning (version 99b8f28c2a76196017eabf23118195ae546f5714), was utilized in this study. Its code is accessible at https://github.com/olsson-group/RL-GraphINVENT/tree/d4629a3c411c793e1ed1682592d5bf67937564a1. This ongoing process of generation, evaluation, and feedback allows the generative models to progressively improve in its ability to design hypothetical polymer structures that closely match the targeted properties, thus enhancing the efficiency and effectiveness of the materials design process.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00395k |
This journal is © The Royal Society of Chemistry 2025 |