Kaleigh A. Curtisac,
Antonia Statt
b and
Wesley F. Reinhart
*ac
aDepartment of Materials Science and Engineering, Pennsylvania State University, University Park, PA, USA. E-mail: reinhart@psu.edu
bDepartment of Materials Science and Engineering, Grainger College of Engineering, University of Illinois Urbana-Champaign, Champaign, IL, USA
cInstitute for Computational and Data Sciences, Pennsylvania State University, University Park, PA, USA
First published on 17th February 2025
Sequence-controlled copolymers can self-assemble into a wide assortment of complex architectures, with exciting applications in nanofabrication and personalized medicine. However, polymer synthesis is notoriously imprecise, and stochasticity in both chemical synthesis and self-assembly poses a significant challenge to tight control over these systems. While it is increasingly viable to design “protein-like” sequences, specifying each individual monomer in a chain, the effect of variability within those sequences has not been well studied. In this work, we performed nearly 15000 molecular dynamics simulations of sequence-controlled copolymer aggregates with varying level of sequence stochasticity. We utilized unsupervised learning to characterize the resulting morphologies and found that sequence variation leads to relatively smooth and predictable changes in morphology compared to ensembles of identical chains. Furthermore, structural response to sequence variation was accurately modeled using supervised learning, revealing several interesting trends in how specific families of sequences break down as monomer sequences become more variable. Our work presents a way forward in understanding and controlling the effect of sequence variation in sequence-controlled copolymer systems, which can hopefully be used to design advanced copolymer systems for technological applications in the future.
In this vein, there has been increasing interest in sequence-controlled or protein-like copolymers as their aggregation behavior can be carefully tuned. The design of single-chain aggregation has been studied for more than 20 years,12 with modern design methods assisting in both accelerating and broadening the search.13–16 Some recent efforts have also been directed toward melts or aggregates of sequence-controlled copolymers,17–19 though this remains somewhat less prevalent compared to studies of single chains. In our own prior work,20,21 we studied the self-assembly of a model sequence-controlled copolymer into a large variety of different aggregate structures and showed that changing even a single monomer in a 20-mer may significantly alter the self-assembled morphology.
While the range of possible morphologies is known, it is challenging to obtain highly sequence-controlled polymers in practice. Chain growth polymerization is statistical in nature and provides imprecise control over the final sequence.22 Some causes of this are side reactions, such as chain transfer and radical coupling, which can be unavoidable. Reversible-deactivation radical polymerization techniques have been shown to generate more accurate sequences, especially in recent years, but they, too, cannot achieve perfect accuracy.23 Some recent developments in synthesis have yielded new levels of sequence control,24–29 but exact control over every monomer in the system still seems unlikely.
Since experimentally synthesized polymers are imperfect (i.e., not identical in length and monomer sequence) due to this inherent randomness in chain synthesis and self-assembly, the polymers community has invested some effort in developing descriptions that account for such variability. One example is BigSMILES,30 an extension of the popular SMILES notation31 for chemical structure. BigSMILES supports “stochastic objects” such as side chains or repeat units that vary statistically due to polymerization chemistry. These stochastic objects are machine-friendly representations of the natural variations in polymer sequences (or, more generally, structures) and allow the encoding of ensembles of similar molecular structures that vary in random but well-behaved ways.
In this work, we extend our previous investigation20,21 to monodisperse (a term we reserve for describing chain length) but non-identical (stochastic) chains, such that each aggregate contains an ensemble of similar sequences derived from a common “template” sequence. We use the term template to refer to the intended or as-designed monomer sequence to clarify that the monomer sequence within each chain in the simulation may vary from that template copolymer. This variation results in an ensemble of monodisperse 20-mers (as in our previous work) whose monomer sequence (and composition) vary in prescribed ways.
The self-assembly of these chains with stochastically varying monomer sequences fundamentally differs from that of either a single chain or ensembles of identical sequence-controlled copolymers, as in our prior work. These stochastic variations pose even greater challenges to conventional methods for modeling macromolecular aggregation such as self-consistent field theory (SCFT),32–36 which cannot account for stochastic sequence variation at the monomer level, especially monomer-level sequence variation within one simulation. Our previous work21 demonstrated the predictive capability of data-driven models to rapidly screen from over 60000 possible monomer sequences to identify the desired morphology. This framework significantly reduces the computational demands previously required for sequence-controlled polymer design. However, our prior work and others17–19 used simulations of perfectly monodisperse and identical polymer chains within the system (all chains of identical length and monomer sequences), which limits their translation to presently available synthesis techniques.28
We simulate the self-assembly of ensembles of sequence-controlled copolymers with stochastic sequence variation using molecular dynamics (MD), a scenario that has received little attention in the published literature. To this end, we generated a large dataset of different template sequences under varying levels of sequence variability, yielding nearly 15000 simulation trajectories. These trajectories were analyzed with our previously established dimensionality reduction technique,20 providing new insight into how these variations in monomer sequence affect the self-assembled morphologies. We also applied supervised learning to model the effect of stochastic sequence variation as a function of template morphology and degree of stochasticity. Our model's prediction error when deployed on unseen test data was comparable to the intrinsic uncertainty resulting from the self-assembly process, indicating excellent performance. Finally, we used this model to quantitatively describe the sensitivity of different self-assembled morphologies to sequence variation.
Our previous work20,21 was limited to completely identical, monodisperse chains. Here, in contrast, stochasticity was introduced by allowing each monomer bead from each template sequence in the simulation to randomly (and independently) flip at a fixed probability p (Fig. 1a). Sequences were independently randomized, giving up to 500 unique sequences (although some independent randomization may coincidentally result in the same sequence). Note that these events resulted in a random selection between A and B type such that p = 1 generated completely random sequences (if p is defined as the probability of changing type then p = 1 would simply yield the sequence with all A replaced by B and vice versa). By defining p as a probability for a bead to flip if and only if the bead succeeds on a coin flip, we avoid having the inverse sequences at p = 1. The inherent stochasticity observed in the self-assembly process20,21 was investigated by creating five replicas of each set of input parameters (i.e., template sequence and p).
The relationship between our p parameter and real-world copolymer systems deserves some consideration. Experimental systems with well-defined reactivity ratios might exhibit some characteristic p, but this would result in variations in monomer sequences, whereas our coarse-grained model has variations in groups of monomers (multiple monomers making up each bead). Also, our coarse-grained system assumes no selectivity in the reaction, which makes the behavior a bit simpler. These effects could be represented within a similar coarse-grained model through additional bead types representing blends of monomer units in different ratios and consequently incorporating a matrix of transition probabilities that depend on the base sequence. We leave this additional complexity to future study, but felt the need to clarify these aspects for the sake of the reader.
![]() | (1) |
![]() | (2) |
![]() | ||
Fig. 2 A representative subset of the 14![]() |
![]() | ||
Fig. 3 Natural variability of Z order parameters throughout the latent space. (a) RMSD for each sequence at each p value, and (b) RMSD as a function of p. |
When introducing stochastic sequences (i.e., p > 0), we find that morphologies tend towards a center point, which we refer to as ZR ≈ (5,1). Such a point must exist because when p = 1, all sequences are equivalent (completely random). This behavior is illustrated for some representative sequences in Fig. 4; we choose sequences for which Z° are on the periphery of the manifold to best illustrate their breakdown from “archetypal” morphologies into the common random morphology. As discussed in our prior work, ZR must be located in the center of the manifold to maximize the distance from the recognizable structures around the periphery.
![]() | ||
Fig. 4 Selected p series from the periphery of the manifold: initially liquid (A) and (B), structured liquid (C)–(E), spherical micelles (F) and (G), wormlike micelles (H), membranes (I) and (J), and vesicles (K). The color scheme is the same as Fig. 2 and ref. 20, with each bead being colored in RGB space according to its position in the local environment manifold. |
On the other hand, the appearance of common pathways on which several different peripheral sequences collapse is not expected. This can be seen especially with the structures labeled K, A, and B, where and ZB,p=0.4 ≈ ZA,p=0.2 ≈ ZK,p=0.1. This indicates that introducing finite p to some sequences drives them towards the template state of others, though no obvious trend appears to the authors in evaluating the sequences by eye: K, AABAABBAAABBBAAABAAB; A, ABABAABBBAABAAAABAAB; B, ABAABAABBAABABABAAAB.
We will hereafter refer to the collection of Z embeddings for a single sequence at different p as a series. Note that the various series shown in Fig. 4 are independently equilibrated at each p, so there is no hysteresis, and the use of lines to connect the different snapshots in the figure is purely for clarity in associating different points with the same series. Furthermore, the figure shows only one of the five replicas simulated for each (sequence, p) pair, so the plot appears quite noisy; averaging over the replicas helps resolve this noise to reveal more consistent trends.
Another view of the same series is shown in Fig. 5. This time, the series is laid out on a grid, with rows representing a common sequence and columns representing a common p. This view clearly shows the gradual convergence of the different sequences to a common, randomized morphology as p → 1. Each panel in Fig. 5 corresponds to a point along the lines in Fig. 4, where tracing the lines from exterior to interior is equivalent to following the series from left to right along a single row.
![]() | ||
Fig. 5 Grid view of the same series selected in Fig. 4, including simulation snapshots at more p values. The color scheme is the same as Fig. 2 and 4, and ref. 20, with each bead being colored in RGB space according to its position in the local environment manifold. |
In evaluating Fig. 4 and 5, we found that spherical micelles (F, AAAAAAABABAABABBBABB) could tolerate more sequence variability than any other structure, remaining micelle-like for p < 0.5. This property extended to adjacent structures E (AABAAAAAAABBAABABBBB) and G (AABBABABBAAAAAAABBAB), which were also able to sustain their approximate structure for p < 0.4. On the other hand, membranes (J, AABBBAABABAAABAAAABB) broke down as soon as p > 0. We analyzed the magnitude of deviation from each sequences template Z° in Fig. 6. While most sequences could tolerate p = 0.1 variability without substantial morphology change, there were some notable outliers, including the liquid, vesicle, and membrane regions of the manifold. These outliers show the structural instability of these three aggregate morphologies when undergoing sequence variation and show that even small changes in sequence can lead to large structural changes.
We hypothesize that the relative structural stability of spherical micelles under greater sequence variation is due to the longer block sequences present in the spherical micelle sequence that remain comparatively unaffected by minor sequence variation. Sequences defined by patterns of long blocks still contain comparatively long blocks when variability is introduced, and thus, the overall pattern remains intact. For sequences defined by patterns of short blocks, small changes in sequence disrupt the pattern more, and the short blocks in the simulation look different both from the template sequence and from each other.
This can be verified by evaluating the sequences most and least prone to deviation in Z. For the p = 0.1 case, the template sequences that are most sensitive to sequence variation are AAABAABAABAAAABABBBB, AAABAABAABAAAABABBBB, and AAABAABAABAAAABABBBB, and the least sensitive are ABBAAABAAABAAAABBBBA, ABAABABABBABAABAAAAB, and ABBAABABAABAABAABAAB. A value of p = 0.1 corresponds to an average of one flipped bead per chain (two selected for flipping, probability of 1/2 to flip). This means that the difference between most and least sensitive chains involves disruptions to the patterns of only a single bead on average. Therefore, it is not surprising that it is difficult or impossible to identify features of these sequences that lead to this result by eye. Instead, like in our prior work,21 we must rely on data-driven models to identify such patterns reliably.
The resulting model allows us to predict the evolution of a series (i.e., fixed sequence, varying p) through the manifold, as shown in Fig. 7. There, we view the trajectories of archetypal (i.e., Z° on the manifold periphery) aggregate structures through the manifold as a function of p. As in Fig. 4, we see the model predicting some preferred paths, namely the S-shaped curve from the liquid region to the area of most randomness. It is interesting that the sequences take a non-linear path, as this indicates that there is some preferential architecture in this area; an explanation for this preference requires further investigation.
We analyzed the model error as a function of p and Z, shown in Fig. 8 and 9, respectively. The error has a slight peak around p = 0.3 but otherwise remains relatively flat for all p > 0, demonstrating that the model accurately predicts the influence of p. Note that the training error for all p is just above the RMSD of approximately 1.0 identified above, indicating that the NN model is learning the trends in the training data almost perfectly. The validation and test errors are slightly higher, but the standard deviations are so large that this effect is relatively small. We also analyzed the error as a function of Z to identify possible bias towards certain morphologies. There is no discernible pattern in Fig. 9, indicating no difference in the error rate across different sequences or structures. It is noteworthy that the model seems to perform equally well in previously identified outlier regions (e.g., from Fig. 6).
![]() | ||
Fig. 8 Model RMSE as a function of p for the training, testing, and validation datasets. Error bars indicate standard deviation. |
![]() | ||
Fig. 9 Spatial representation of model RMSE across the manifold for the training, testing, and validation datasets, colored by the error. |
![]() | ||
Fig. 10 Norm of the trained model gradients across the manifold for different values of p, colored by the norm (i.e., eqn (2)). |
We generated a dataset of 14795 MD simulation trajectories of 259 monomer sequences with 11 variations of p ranging from 0 to 1 in intervals of 0.1, representing nearly 2000 GPU hours of compute time on NVIDIA A100 GPUs. An unsupervised representation learning technique developed in our recent work20 was applied to learn relevant order parameters. Supervised learning was then used to predict the effect of independent random sequence variation at rate p within each chain via a shallow NN. This model was then analyzed to identify patterns across the entire morphology space and to quantitatively describe the simulation results.
From the MD simulations, we found that liquids, membranes, and vesicles were more sensitive to sequence randomness than other aggregate structures, as indicated by the large number of statistical outliers present in these regions when analyzing the morphology deviation as a function of p. This observation is consistent with the explanation that sequences defined by small blocks have relatively high sensitivity to single monomer edits compared to sequences defined by large blocks.
We also observed that the series tended to progress through the manifold between Z° and ZR via a relatively jagged and stochastic path, requiring averaging over many replicas in order to obtain reliable trends. We have previously shown that the Z order parameters are smooth within the dynamics of a single simulation,20 so when it occurs, this shows that small perturbations to the sequence can lead to relatively large changes in morphology.
Despite the stochastic nature of the self-assembly process, our supervised NN performed well, resulting in a validation RMSE of 1.04, slightly lower than the observed RMSD of 1.07 from simulation replicas; this indicates that performance is on par with the intrinsic variance from self-assembly. The model showed no bias toward either particular p (aside from p = 0 performing well) or Z, indicating that it is equally valid across the entire manifold. The model also revealed some interesting contours in the trajectories of sequence embeddings throughout the manifold. Preferred pathways were visible, indicating collapse onto locally common aggregate structures during randomization. The underlying reason for these preferred intermediate morphologies will require further study. Spatial analysis of model gradients showed the largest structural shifts around the liquid, membrane, and vesicle areas of the manifold, quantitatively describing the previously observed sensitivity to sequence variation in these areas.
We have shown that stochasticity in chemical sequences can profoundly affect self-assembly and that this affects different sequences to very different degrees depending on their characteristics. This is an important step towards understanding the potential and limitations of sequence-controlled polymers with imperfect chemical sequences, as typically synthesized in the laboratory. We have demonstrated that our previously developed framework can be applied to study stochastic sequences and, together with the supervised learning approach deployed here, may be useful for probing the robustness of a particular morphology against sequence variations. We believe this is an important design consideration as computational tools converge with experimental applications. In future work, we will consider more complex chain structures, including varying lengths and sequences simultaneously, further to align our models with attainable, real-world chemical systems.
This journal is © The Royal Society of Chemistry 2025 |