Rose K.
Cersonsky
*a,
Maria
Pakhnova
a,
Edgar A.
Engel
b and
Michele
Ceriotti
a
aLaboratory of Computational Science and Modeling (COSMO), École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland. E-mail: Rose.Cersonsky@wisc.edu
bTCM Group, Trinity College, Cambridge University, Cambridge, UK
First published on 16th January 2023
Due to the subtle balance of intermolecular interactions that govern structure–property relations, predicting the stability of crystal structures formed from molecular building blocks is a highly non-trivial scientific problem. A particularly active and fruitful approach involves classifying the different combinations of interacting chemical moieties, as understanding the relative energetics of different interactions enables the design of molecular crystals and fine-tuning of their stabilities. While this is usually performed based on the empirical observation of the most commonly encountered motifs in known crystal structures, we propose to apply a combination of supervised and unsupervised machine-learning techniques to automate the construction of an extensive library of molecular building blocks. We introduce a structural descriptor tailored to the prediction of the binding (lattice) energy and apply it to a curated dataset of organic crystals, exploiting its atom-centered nature to obtain a data-driven assessment of the contribution of different chemical groups to the lattice energy of the crystal. We then interpret this library using a low-dimensional representation of the structure–energy landscape and discuss selected examples of the insights into crystal engineering that can be extracted from this analysis, providing a complete database to guide the design of molecular materials.
Yet, molecular crystallization is a complex process that involves multiple cooperative and competing forces. Initial nucleation is typically motivated by strong interactions between functional groups.10,11 The structural patterns associated with these guiding interactions (deemed “supramolecular synthons”) and their hierarchies are often the focus of experimental and computational studies in crystal structure prediction.12,13 Nevertheless, once molecules have moved within closer range, many factors, including weaker interactions, the expulsion of solvent molecules, and geometric packing, will then determine the short- and potentially long-range order, leading to many potentially-stable polymorphs for a given stoichiometry. In the past decades, there has been a growing push to develop a “holistic” view of molecular crystallization,14,15 not only taking into account the nearest-neighbor contacts but also the interplay of these interactions with other components of the molecular assembly.
While it is simpler to rationalize single-site interactions, the interplay of many competing interactions necessitates diverse, high-throughput studies.14 Thus, molecular crystallization has emerged as a hotbed for computational inquiry. This focus has led to considerable theoretical and software developments for qualitative and quantitative analyses, including those tailored to crystal structure prediction (CSP)16–19 and the representation of electrostatic surfaces and molecular geometry.20,21 Even more recently, machine learning has been used to understand the individual configurational and energy landscapes of molecules;22–28 however, such techniques have yet to be applied in the general, holistic vein required to extract the qualitative insights that can be used to support crystal design efforts.
To study molecular crystallization in this broad lens, we have curated a dataset of roughly 3260 C + H + N + O + S-containing molecular crystals from those reported in Cordova et al.29 In Cordova et al.,29 these crystals were initially selected by querying the Cambridge Structural Database (CSD) to identify a diverse set of synthesizable molecular assemblies, including some that had been experimentally stabilized at extreme conditions. The experimental properties of the full dataset are summarized in ESI Appendix A3.†
The stability of molecular crystals is traditionally studied through the binding (lattice) energy, which is computationally determined by evaluating the ground-state energies for both the crystal and its molecular components in the dilute gas limit, here computed using DFT-PBE-D2 calculations of each crystal and its relaxed molecular components at ambient conditions.
From here, we build an atom-centered regression model for this lattice energy, demonstrating the improvements in accuracy and reduction in model complexity from using a physics-informed approach. This atom-centered approach, wherein we represent each molecular crystal using the average ML descriptor for each of its atomic constituents, facilitates estimating the contribution of each atom, or group of atoms, to the binding energy. Then, employing a combination of supervised and unsupervised machine learning models, we can determine and interpret each molecular moiety's intermolecular interactions. Using these approaches, we show how physically-motivated machine learning models can not only “rediscover” the known maxims of crystal engineering, but provide insight and guidance for crystal design. We have made our datasets, and analyses openly-available through the Materials Cloud,31 with interactive components aimed to guide future molecular design and narrower or targeted studies.
The descriptor for a given collection should be assumed as the average of the descriptors for the constituent atoms:
(1) |
For example, the descriptor for the atoms in a molecule in a dilute gas is . If we were to look at the same molecule in the crystalline solid we would get . A schematic of these concepts is shown in Fig. 1, using the co-crystal 5-aminotetrazole monohydrate (CSD ref. AMTETZ30) as an example.
Fig. 1 Visualization of descriptor notation, as described in Section 2.1, visualized for 5-aminotetrazole monohydrate (CSD ref. AMTETZ30). Each descriptor contains the information of an atom and its neighborhood (shown in yellow shading). |
eσ = xσwσ + εσ | (2) |
(3) |
With the average lattice energy per atom given by,
(4) |
Later, we will use our regression model to determine the atomic contributions to the lattice energy, which we will denote δa, where . We will also consider the contributions for different collection of atoms, and will denote the average lattice energy contribution as . When we regularize these contributions using a Gaussian filter (discussed in Section 3.2 and Appendix B2†), we will use a tilde to give .
(5) |
(6) |
Eqn (6) may then be rewritten as:
δc = x(s)cwc − x(g)cwm + ε | (7) |
When we predict the crystal and molecular atomic energies ec and em, we obtain RMSEs of 1.15 kJ mol−1 and 0.727 kJ mol−1, respectively, which are acceptably small compared to the intrinsic variance of the baselined‡ target energies of the test set, which have standard deviations of 4.402 kJ mol−1 and 4.251 kJ mol−1, respectively. However, the intrinsic variance of the lattice energies is smaller (1.965 kJ mol−1); therefore, the resulting RMSE of 0.916 kJ mol−1 from eqn (6) is unsatisfactory and suggests that the errors in the independent regressions generally overlap with the lattice energy contributions.
Yet, conceptually, neither of these two representations (x(s)c and x(g)c) contain the full set of relevant information – the molecular descriptor x(g)c is missing information on intermolecular interactions, and the crystal descriptor x(s)c is unaware of the conformational changes that the molecules undergo upon crystallization. The necessity of this missing information is confirmed when we regress on concatenated descriptors {x(s)c, x(g)c} and our RMSE drops to 0.671 kJ mol−1.
Furthermore, eqn (7) provides another way to similarly (and more explicitly) encode the nature of the problem into our choice of representation. Given a descriptor that appropriately distinguishes between periodic crystals and molecules, a regression model can predict their energies using the same regression weights, w = wc = wm. Substituting this into eqn (7),
δc = x(s)cw − x(g)cw + ε | (8) |
δc = x(s−g)cw + ε | (9) |
When the molecular geometry is known a priori, these results suggest that linear (summarized in Table 1) and non-linear regressions (ESI Appendix C2†) for the lattice energy should be built on descriptors conceptually akin to x(s−g)c, rather than x(s)c, as has been common practice in the literature.23,25,27,28 Thus, in the remainder of the text, we will employ ML fingerprints and models based on the remnant descriptor.
Regression equation | Eqn | RMSE | MAE |
---|---|---|---|
a Here we report the errors (in kJ mol−1) on a separate set of 551 crystals (or the coinciding 628 molecules). In each regression equation w is unique to that regression. | |||
e c = x(s)cwc | (2) | 1.15 | 0.863 |
e m = x(g)mwm | (2) | 0.727 | 0.563 |
(6) | 0.916 | 0.652 | |
δ c = x(s)cw | 0.778 | 0.552 | |
δ c = x(g)cw | 1.101 | 0.723 | |
δ c = x(s–g)cw | (9) | 0.571 | 0.404 |
δ c = {x(s)c, x(g)c}w | 0.671 | 0.461 |
δa = x(s−g)aw | (10) |
Despite the mathematical logic behind this step, the lack of physical underpinnings for this decomposition may result in energy being arbitrarily partitioned between neighboring atoms. This leads to disproportionately large contributions of opposite size being assigned, not dissimilar to how a regression may overfit by assigning large regression weights. To ease this effect, we can apply a Gaussian filter to each δa. For the ith atom, this results in
(11) |
(12) |
We plot the span of lattice energy contributions for motifs with greater than 200 instances in the dataset in Fig. 2. The functional groups are arranged in order of increasing average cohesive interactions. Nearly all functional groups, on average, are stabilizing, although we see a clear trend in the nature of the functional groups from left to right. On the left (the motifs leading to the strongest intermolecular interactions), there are groups typically associated with hydrogen bonding (e.g., carboxyls and waters). As we move to the right, the molecular motifs are, on average, weakly binding, with the largest range of interactions coming from the most broadly-defined groups, including the alkanes, alkenes, and benzene-like rings.
Fig. 2 Distribution of energetic contributions for different functional groups. For each functional group, we have taken the averaged remnant descriptor x(s−g)σ and computed the estimated contribution to the binding energy σ using the regressions detailed in eqn (10) and the filtering procedure in eqn (11). We have arranged the functional groups in order of average contribution, with a representative example is shown above or below the violin plot with the functional group highlighted. We have limited this figure to those functional groups with more than 200 instances in the dataset (see Fig. S4† for all groups). The lines on each plot denote each group's extreme and mean contributions. The plots are colored by the number of examples within the dataset, ranging from 4 (pentazole) to 5313 (methyl groups). Wider sections of the violin plot represent a higher probability that members of the population will take on the given value; the skinnier sections represent a lower probability. |
This trend is further demonstrated by plotting the structure–property map of all motifs using Principal Covariates Regression (PCovR), a hybrid supervised-unsupervised dimensionality reduction technique first introduced in De Jong and Kiers40 and adapted to chemical systems in Helfrecht et al.41 This technique produces a latent-space mapping that arranges different motif classes based on their structural similarity and correlation to a set of target properties. In Fig. 3, we show a map using the average remnant descriptor for each motif and their average energy contribution, using contour lines to show where 90% of such motifs fall on the PCovR map. One sees that, in this case, the first axis of this plot (PCov1) correlates strongly with the (learnable) cohesive interactions. The second axis (PCov2) allows us to resolve structural differences between motifs with similar energetic contributions. In this mapping, we can learn from the spread of each group. For example, the 868 water molecules (light blue in Fig. 3) span the greater portion of the left-hand side of the figure, highlighting the chemical diversity of intermolecular water interactions. Juxtapose this with the 2627 nitro and nitroso groups (pink in Fig. 3) that span a smaller region in PCovR space, implying a narrower range of intermolecular interactions. Here we have combined several groups for visual simplicity; however, we have included plots highlighting each functional group in Fig. S6–S9,† including the sample sizes and range of contributions.
The PCovR framework also provides a blueprint for analyzing the interactions of different structural motifs – given a single motif type, what characteristics of a molecular environment lead to a more stabilizing interaction? In the following sections, we will take a look at the stabilizing environments for a few classes of functional groups, starting with the well-known stabilizing interactions of water and carboxylic groups, then moving onto two groups with a wide range of intermolecular interactions, 6-membered aromatic carbon rings and nitro groups. With each functional group, we generate a new PCovR using only the averaged remnant descriptors and effective interactions for the instances of that group, such that the structural diversity embedded in the map reflects the diversity of interactions, rather than the diversity of the molecules. We have included similar maps for all other molecular motifs in an online data repository.31,48
Fig. 4 The interactions of water molecules. (Left) Principal Covariates Regression (PCovR) map, where the color of each point denotes the estimated cohesive interaction of that motif and a marker on the color bar denotes the average value for all waters. (Right) The PCovR map recolored by the number of hydrogen bonds (H-bonds), separating those donated to the oxygen atom (top) from those donated by the hydrogen atoms (bottom). The insets on the bottom visualize several extremal or interesting environments. Select crystalline configurations and energy assignments: CSD ref. (a) SOWTIH,42 (b) GIXDIA,43 (c) LEBJUX,44 (d) LACTOS12,45 (e) VOHBUR,46 and (f) COFHOW.47 In each panel, the bottom row shows the total lattice energy of the crystal (in kJ mol−1) and the corresponding molecules where the atoms have been recolored by their estimated lattice energy contribution (on the same scale as on the PCovR map). |
First, we look at a common parameter for measuring the stabilizing effect of water: hydrogen bonding (H-bonding). Here, we have calculated H-bonds based on when the O⋯H or H⋯X distance is less than 2.5 Å and the dihedral angle of O⋯H–X or OH⋯X is greater than 150°. From the right side of Fig. 4, we see that the number of H-bonds donated to the water molecule (O⋯H) does not correlate with the cohesive interaction of the water molecules. There is some qualitative correlation/anti-correlation between the nature of these donated H-bonds and the second principal covariate (Pearson Correlation Coefficient, or PCC, = 0.49, −0.59 for the number of O⋯H–N and O⋯H–O, respectively). There is a mild anti-correlation between the number of H-bonds the water itself donates (OH⋯X) and the first covariate, with a PCC of −0.33. The second principal covariate is strongly correlated and anti-correlated with the number of OH⋯N and OH⋯O interactions, achieving a PCC of 0.69 and −0.73, respectively. Waters with primarily OH⋯N-type hydrogen bonds are at the top of the map (e.g., Fig. 4(e), CSD ref. VOHBUR46), with OH⋯O-type at the bottom of the map (e.g., Fig. 4(c) and (d), CSD ref. LEBJUX44 and LACTOS1245).
This analysis emphasizes that the number of hydrogen bonds does not fully capture all of the nuances of water stabilization – the majority of water molecules participate in 2–3 such interactions, and the energy of these bonds can span a wide range. In O⋯H–X interactions, there is little energetic difference based on whether the acceptor is a nitrogen or oxygen atom – both types of hydrogen bonds span the full range of energies. The nature of the acceptor is encoded in the covariate orthogonal to the chemical features most correlated with interaction strength (i.e., the nature of the acceptor is primarily correlated with the second covariate).
We see that the strongest water interactions in 1,6-diaminohexane monohydrate (CSD ref. SOWTIH,42Fig. 4(a)) and 1,3-diaminopropane trihydrate (CSD ref. GIXDIA,43Fig. 4(b)), where the water molecules associate with other water molecules and the amine group of their co-crystalline molecule. Our weakest contribution, by far, occurs in 4,5,6,7-tetranitro-1,3-dihydro-2H-benzimidazol-2-one hemihydrate (CSD ref. COFHOW,47Fig. 4(f)), where the water molecules sit interstitial to the imidazole molecules, prohibited from forming hydrogen bonds and potentially interfering with the stabilization of the imidazole clusters.
Fig. 5 The interactions of carboxylic acid groups. (Left) Principal Covariates Regression (PCovR) map, where the color of each point denotes the estimated cohesive interaction of that motif, and a marker on the color bar denotes the average value for all carboxylic acid groups. The insets visualize several extremal or interesting motifs. (Right) Select crystalline configurations and energy assignments: CSD ref. (a and b) FIBHOP,49 (c) IJETOG02,50 (d) SEPNUX,51 (e) NAPDCX,52 and (f) CARCAZ.53 In each panel on the right, the bottom row shows the total lattice energy of the crystal (in kJ mol−1) and the corresponding molecules where the atoms have been recolored by their estimated lattice energy contribution. |
The strongest contributions are found in 1,2-di(2-pyridyl)ethylene (CSD ref. FIBHOP49) in a succinic acid molecule (Fig. 5(a)) that forms two sets of supramolecular synthons: one homosynthon with the other succinic acid (Fig. 5(b)), and one heterosynthon with the pyridine group (consistent with the literature on the strength of carboxylic–pyridine interactions56–58). Interestingly, this crystal also contains one of the most weakly interacting groups (Fig. 5(b)), in the second succinic acid molecule that only participates in the single homosynthon.
Carboxylic acids form the strongest cohesive interactions when participating in multiple synthons, particularly heterosynthons (typified by Fig. 5(a) and (c), and noted in earlier literature64). Moving to the right, we see the contribution decrease commensurate to the number of interactions. For example, in 3,5-pyrazoledicarboxylic acid (CSD ref. SEPNUX,51Fig. 5(d)), there are two carboxylic acid groups that have drastically different energy contributions – one that forms a doublet homosynthon and the other is without close contacts. In an extreme case (CSD ref. CARCAZ,53Fig. 5(f)), the carboxylic acid group is prevented from interacting due to the bulkiness of the overall molecule, leading to a neutral contribution.
An interesting success of this energy assignment is the ability to identify stabilizing motifs in otherwise unstable or metastable crystals. This is the case for CSD ref. NAPDCX52 (Fig. 5(e)), an unstable 1,4-naphthalene-dicarboxylic acid that has an overall positive lattice energy at ambient pressure and temperature.¶ Despite this instability, we can clearly identify a binding interaction between carboxylic acid groups.
Fig. 6 The interactions of benzene-like rings. (Left) Principal Covariates Regression (PCovR) map, where the color of each point denotes the estimated cohesive interaction of that motif and a marker on the color bar denotes the average value of benzene-like rings. The insets visualize several extremal or interesting motifs. (Right) Select crystalline configurations and energy assignments: CSD ref. (a) TATNBZ03,59 (b) BENZAC19,60 (c) PHENAN14,61 and (d) NOTVAT01.62 We also highlight the benzene-like motif from Fig. 5(f) in (e). In each panel on the right, the bottom row shows the total lattice energy of the crystal (in kJ mol−1) and the corresponding molecules where the atoms have been recolored by their estimated lattice energy contribution (on the same scale as on the left panel). |
The most strongly-binding benzene-like motifs occur in molecules where (1) the ring is functionalized by strongly interacting groups, (2) the interactions of these groups facilitate planar molecular geometry, and (3) stacking occurs between the benzene-like rings with these auxiliary groups. We see this in 2,4,6-trinitrobenzene-1,3,5-triamine (CSD ref. TATNBZ03,59Fig. 6(b)), where the aromatic carbon ring stacks above the primarily intramolecular nitro–amine interaction and in Fig. 6(a) (CSD ref. BENZAC1960), where they stack above the carboxylic acid homosynthon.
There are various reasons for weakly-binding benzene-like motifs, including weak stacking and steric hindrance. As is evident from Fig. 6(c) and (d), rings will resist crystallization when the interactions of the end groups lead to deformation of the ring geometry. Take for example phenanthrene (CSD ref. PHENAN14,61Fig. 6(c)), a high-pressure polymorph that is unstable at ambient conditions (therefore has an overall positive lattice energy for the DFT reference used). Interestingly, we can pinpoint the localization of this deformation by looking at the atoms with the strongest positive contribution. While the keen reader may infer that this is solely due to the remnant descriptor reflecting the difference in strained and relaxed molecular geometry, we will note that a large difference in these representations can also coincide with a wealth of stabilizing intermolecular interactions, demonstrating that this simple linear model can differentiate molecular deformation from the introduction of new interactions.
This is further supported by comparing the motifs of this polymorph with its ambient-pressure, stable counterpart (CSD ref. PHENAN0863) to see how the nature of the same molecule changes based upon the interactions in the crystal. Both polymorphs adopt a similar herringbone crystal structure; however, the decreased molecular distortion and increased interactions between the auxiliary hydrogens and neighboring aromatic rings in PHENAN0863 result in a significantly lower lattice energy of δc = −4.58. In Fig. 7, we project the motifs of PHENAN0863 and PHENAN1461 onto our PCovR map from Fig. 6, we see this reflected by a left-shift of the motifs on the map, where the center ring moves from strongly resisting crystallization (Fig. 7(c)) to weakly interacting (Fig. 7(f)) and the periphery rings move from weakly resisting crystallization (Fig. 7(a) and (b)) to weakly binding (Fig. 7(e) and (f)). It is worth noting that PHENAN0863 is an out-of-sample data point (ε = 0.2 kJ mol−1), demonstrating that the analysis in Fig. 6 is applicable beyond the initial reference set. We have included images of the PHENAN0863 crystal configuration in Fig. S5.†
Fig. 7 Comparing the motifs in polymorphs of phenanthrene. Here we project two distinct polymorphs of phenanthrene onto the PCovR map shown in Fig. 6. At ambient conditions, one polymorph is stable (PHENAN0863), while the other is unstable (PHENAN14,61 also shown in Fig. 6(c)). Looking at the same motifs in the unstable and stable phases, we see a shift leftwards as the motifs go from resisting crystallization to weakly binding. In the lower insets, we have recolored the atoms of the phenanthrene molecule based upon their contribution to the lattice energy in the different polymorphs. Note that the bicyclic carbon atoms, while no longer distorted, weakly resist crystallization, as they prevent the auxiliary hydrogens from more closely interacting with neighboring π bonds by distorting the molecule. |
Fig. 8 The interactions of nitro groups. (Left) Principal Covariates Regression (PCovR) map, where the color of each point denotes the estimated cohesive interaction of that motif. (Right) Select crystalline configurations and energy assignments: CSD ref. (a) TIJKEC,65 (b) KATZEN,66 (c) CUPYUJ,67 (d) KEDJUB,68 and (e) LAYSOV.69 We also highlight the nitro group of TATNBZ0359 from Fig. 6(b) in (f). In each panel on the right, the bottom row shows the total lattice energy of the crystal (in kJ mol−1) and the corresponding molecules where the atoms have been recolored by their estimated lattice energy contribution (on the same scale as on the left panel). |
The resonant or partial charge of the oxygen atoms leads to strong binding in hydrogen-rich environments, supported by the results in Fig. 8. This is best typified by trans-N,N-dimethyl-2-nitrovinylamine (CSD ref. MNETAM0170), a molecule where the nitro group is strongly interacting with the CH3 end groups with some potential π-hole stacking71 between the nitrogen moieties, as shown in Fig. 8(a). The strength of these binding interactions lessens with the strength of the electron donors, with smaller contributions in crystals where the primary O⋯H interaction is with amine donors (e.g., Fig. 8(c) and (d), CSD ref. CUPYUJ,67 KEDJUB68). In some of these cases, the binding is likely weakened by intramolecular interactions, similar to the contributions of the nitro groups in 2,4,6-trinitrobenzene-1,3,5-triamine (CSD ref. TATNBZ03,59Fig. 8(f), seen earlier in Fig. 6(b)). Finally, to the right of the map, we see the strongest repulsive interactions from nitro groups in proximity to other nitro or aromatic nitrogen groups, such as the nitro–oxidiazole interaction in 3-(3,5-dinitro-1H-pyrazol-4-yl)-4-nitro-1,2,5-oxadiazole (CSD ref. LAYSOV,69Fig. 8(f)).
We first compute the relaxed energies of the co-crystals and their molecular components, following the procedures outlined in ESI Appendix A† to obtain the reference geometries and binding energies of each crystal. For reference, our previous model built using eqn (9) achieves an RMSE of 0.45 kJ mol−1 and an MAE of 0.35 kJ mol−1 more than sufficient to distinguish between the different categories of co-forming molecules, yet unable to provide any guidance in isomeric contexts (we have included a labeled parity plot in Fig. S10†). Following the procedure outlined in ESI Appendix B,† we identify the functional groups within the ethenzamide and estimate the contribution of their interactions to the molecular binding.
As shown in Fig. 9, unsurprisingly, most of the binding interactions occur due to the acetamide group in the ethenzamide, with a 3.5 kJ mol−1 difference between the weakest contributing acetamide motif and the most strongly contributing ethyl group, which is beyond the error in the overall model. From Fig. 9, we can also see that, while the ethyl and benzene-like rings behave similarly to other similar motifs across the entire dataset, the acetamide and ether groups are generally more and less stabilizing, respectively, than their counterparts at large. With the ether groups, this is reasonable – the geometry of the ether prevents much intermolecular interaction. With the acetamide group, this demonstrates that there is a large range of engineering that can happen to affect crystalline stability, which might be beneficial when considering molecular solubility.
Fig. 9 Distribution of energetic contributions for the functional groups of ethenzamide. Following the procedure outlined in 3.2, we have computed the estimated contribution to the binding energy σ. Similar to Fig. 2, we have arranged the functional groups in order of average contribution, and the lines on each plot denote each group's extreme and mean contributions. Wider sections of the violin plot represent a higher probability that members of the population will take on the given value; the skinnier sections represent a lower probability. Here, darker sections refer to the distribution in the functional groups of the ethenzamide molecules, with lighter sections showing the distribution for the same functional group across the entire training set. |
From here, we use the PCovR of acetamide groups to identify other acetamide motifs that behave similarly or dissimilarly to those we see in the known ethenzamide co-crystals, as highlighted in Fig. 10. We first train our PCovR model on the acetamide groups in the training set, and project those from the ethenzamide dataset into the corresponding latent space. Because the interactions across the training set are much more diverse than within the ethenzamide set, we plot along the first and third covariate to show the best distinction between the two datasets.|| We define chemical similarity based upon the Euclidean distance in PCovR space – that is, we identify acetamide groups that appear at a similar place to those in ethenzamide co-crystals on the map in Fig. 10. Note that because we compute the distance using all covariates, some points that seem extremal in Fig. 10 are not, as they are closer or further from the ethenzamide acetamide groups in other dimensions.
Fig. 10 The interactions of acetamide groups. (Top) Principal Covariates Regression (PCovR) map, where the color of each point denotes the estimated cohesive interaction of that motif. Here we have plotted along the first and third covariates to best distinguish the acetamide groups of the known ethenzamide co-crystals from other groups. The markers correspond to: (x) known ethenzamide co-crystals, (o) crystals with similar acetamide interactions, and (*) crystals with dissimilar acetamide interactions. (middle) Crystals and molecules with similar acetamide interactions as those in the known ethenzamide co-crystals, including CSD ref. (a) HXBNZM,83 (b) MEGDOS,84 and (c) LORMOV.85 (bottom) Crystals dissimilar acetamide interactions to the known ethenzamide co-crystals: CSD ref. (d) AJEREM,86 (e) CAHOAM,87 (f) XAVZUP,88 (g) LIFNOE,89 (h) KEPDIU90 and (i) ATZCBX.91 Insets have been ordered from highest to lowest similarity to the interactions in the known ethenzamide co-crystals. |
We highlight the molecules that form the most similar acetamide networks to those in the ethenzamide dataset in Fig. 10 using an (o) marker and showing the molecule below. Those closest in PCovR space are molecules that form single acetamide homosynthons (e.g., Fig. 10(b) and (c), CSD ref. MEGDOS84 and LORMOV85) or heterosynthons with a carboxylic acid group (e.g., Fig. 10(a), CSD ref. HXBNZM83).
Perhaps more interesting are the acetamide groups that form different interactions, highlighted in the bottom panel in Fig. 10. These groups give insight into the other supramolecular synthons that form with ethenzamide across a range of stabilizing and destabilizing contributions. We see strong interactions in triazole-5-carboxaldehyde (Fig. 10(i), CSD ref. ATZCBX91), where the acetamide group forms a heterosynthon with the triazole group, and in O-carbamoylhydroxylamine (Fig. 10(e), CSD ref. CAHOAM87), where the small size of the molecule facilitates both multiple homosynthons between acetamide groups, as well as heterosynthons with the oxygen of the hydroxylamine groups. In 2-oxopyrrolidineacetamide dihydrate (Fig. 10(g), CSD ref. LIFNOE89), a network of hydrogen bonds is formed between acetamide groups and water molecules. In azidoacetamide (Fig. 10(f), CSD ref. XAVZUP88), we see an acetamide homosynthon formed at an offset so that the azide group can stack directly above the NH⋯O interaction. We see weaker interactions in tetrazole-5-carboxamide (Fig. 10(h), CSD ref. KEPDIU90), where the acetamide group is interacting with the azole group, which, when compared with triazole-5-carboxaldehyde (Fig. 10(i)), demonstrates a large range for acetamide-azole synthon binding. Finally, in 1-methoxyaziridine-2,2-dicarboxamide (Fig. 10(d), CSD ref. AJEREM86), despite multiple acetamide interactions, there is a weaker acetamide network, likely due to the geometry of the molecule itself.
We do not suggest that these molecules could be used directly as co-formers; the training set was obtained with diversity as the primary goal, with no regard for availability, toxicity, ease of synthesis, or stability. Instead, each of these related and unrelated crystals gives insight into the types of interactions that may beget new ethenzamide co-crystals. The molecules shown in Fig. 10 can be used as inspiration to identify co-former candidates from libraries of biocompatible compounds and to guide future crystallization studies.
In doing so, we have to strike a balance between several conflicting goals. By selecting structures from the CSD with maximal structural diversity, we ensure that we cover a broad range of chemical and packing motifs, while remaining focused on structures that are known to be experimentally realizable. By using a general-purpose, atom-centered structural representation that is capable of describing arbitrary structural correlations, we ensure that our data analysis is flexible and that it does not incorporate pre-conceived notions about molecular bonding. At the same time, we ensure that the model focuses on the features that are most relevant to determine crystal stability by building a remnant descriptor that mimics the definition of the lattice energy as a difference between the total energies of the crystal and its constituents.
The resulting models achieve a respectable mean absolute error of about 0.4 kJ mol−1 in predicting the atomic contributions to crystal stability using these descriptors that gives us a semi-quantitative estimate of the contribution of each atomic environment to the lattice energy and to compare between different co-crystals or between polymorphs that are stable at very different conditions. In order to translate these atomic contributions in a language that can be useful to crystal chemistry, we then assemble them to estimate the stabilizing power of traditional chemical groups (carboxylic acids, amines, …) and build data-driven maps that facilitate the comparison of different chemical environments by expressing simultaneously the structural variability and correlation with the lattice energy contribution. For each chemical moiety we provide an interactive map (on Materials Cloud48) that allows to juxtapose different types of crystal environments, to identify structural patterns that are either stabilizing or destabilizing, and to contrast them with conventional motifs (e.g. hydrogen-bonding), demonstrated here for a few selected cases. As we demonstrate for phenanthrene, it is also possible to use these maps to compare polymorphs of the same molecule, and to analyze molecular motifs for a structure that is not part of our original reference set. With these tools, we aim to guide those designing molecular co-crystals in identifying suitable co-formers, as demonstrated for the analgesic ethenzamide.
We hope that this library of molecular motifs will prove useful to applications to specific crystal-design problems. More broadly, we believe that the general ML protocol that we follow, combining regression of the ultimate target property with unsupervised analysis of molecular motifs, can inspire similar applications to the study of other classes of materials, ranging from metal and covalent organic frameworks to self-assembled monolayers and biological systems.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2sc06198h |
‡ To improve the regressions of crystal and molecular energies, we subtract a baseline determined by linear regression of the atomic composition on the total energies. |
§ The lattice energy of the crystal is not the sum of these motif contributions, as (1) both are averaged quantities, and (2) a single crystal may have overlapping motifs. |
¶ Of the approximately 3200 crystals studied in this article, only 23 have a positive lattice energy. This is covered in more detail in ESI Appendix A3. |
|| There is an interactive map of the first four covariates within our online repository at Cersonsky et al.48 |
This journal is © The Royal Society of Chemistry 2023 |