Zachary T. P.
Fried
*a,
Kin Long Kelvin
Lee
b,
Alex N.
Byrne
c and
Brett A.
McGuire
*de
aDepartment of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. E-mail: zfried@mit.edu
bAccelerated Computing Systems and Graphics Group, Intel Corporation, 2111 NE 25th Ave., Hillsboro, OR 97124, USA
cDepartment of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
dDepartment of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. E-mail: brettmc@mit.edu
eNational Radio Astronomy Observatory, Charlottesville, VA 22903, USA
First published on 9th May 2023
Machine learning techniques have been previously used to model and predict column densities in the TMC-1 dark molecular cloud. In interstellar sources further along the path of star formation, such as those where a protostar itself has been formed, the chemistry is known to be drastically different from that of largely quiescent dark clouds. To that end, we have tested the ability of various machine learning models to fit the column densities of the molecules detected in source B of the Class 0 protostellar system IRAS 16293-2422. By including a simple encoding of isotopic composition in our molecular feature vectors, we also examine for the first time how well these models can replicate the isotopic ratios. Finally, we report the predicted column densities of the chemically relevant molecules that may be excellent targets for radioastronomical detection in IRAS 16293-2422B.
In an attempt to predict molecular abundances without the need for complete networks, Lee et al.11 introduced a novel methodology involving machine learning. A major benefit of their approach contrasts traditional astrochemical modeling, as it requires no prior knowledge of the conditions of an interstellar source or any reaction pathways involving the previously detected molecules. Instead, abundances are expressed purely in terms of a chemical vector space. Simple regression algorithms were shown to significantly outperform traditional astrochemical models in reproducing the abundances of molecules already observed, and provided a straightforward way to extrapolate to yet undetected molecules.
An interstellar source for which this machine learning technique could be effectively applied is the Class 0 protostar IRAS 16293-2422B (hereafter referred to as IRAS 16293B). IRAS 16293B is one component of the protostellar system IRAS 16293, which is located in the L1689 region of the ρ Ophiuchus cloud complex. Interferometric observations initially revealed two protostellar sources in IRAS 16293 (source A and source B), separated by around 5.1′′.12–14 Further high-resolution studies then confirmed that source A is in fact composed of two compact sources (source A1 and A2), making IRAS 16293 a triple protostellar system.15 Extensive observations have been made of this source with the Atacama Large Millimeter/submillimeter Array (ALMA) as part of the Protostellar Interferometric Line Survey (PILS) program.16 The submillimeter spectrum toward IRAS 16293B is especially rich with more than 10000 features detected.16 The line widths of the spectral peaks are also extremely narrow for a star forming region (∼1 km s−1 FWHM), which significantly reduces line confusion and makes this an excellent source for molecular detections.
The predictive power of the machine learning method introduced by Lee et al.11 may be especially useful for IRAS 16293B since a large portion of the molecular lines in the interstellar line survey remain unassigned. In fact, as of 2018, Taquet et al.17 noted that approximately 70% of the 5σ transitions identified in the ALMA Band 6 dataset were unidentified. If successful, this method might be able to provide an unbiased list of astrochemical targets not yet detected but which might be abundant enough to be contributing to the unidentified molecular lines. If subsequently detected, these molecules and their abundances could then be used to further constrain both the machine learning model and traditional network-based astrochemical models of low mass protostars. These models provide invaluable insight into the chemical processes and conditions relevant to the formation of stars similar to our Sun.
One aspect of interstellar chemistry that was not treated in the work of Lee et al.11 was the incorporation of isotopically substituted species. While certainly such rare isotopologues are present and detectable in TMC-1,18 detections of these species are more common toward IRAS 16293B and therefore provide substantially more insight into the chemical and physical history of this source. Therefore, it is desirable to update the machine learning model to include isotopologues. Molecules in IRAS 16293B consistently display isotopic ratios that are enhanced compared to the mean solar value and other interstellar sources, especially deuterium (D) and 13C substituted species. In fact, the deuterated isotopologues of ethanol, ketene, acetaldehyde, formic acid, formamide, and isocyanic acid were all first detected toward this source.19,20 Various doubly and triply deuterated species have been detected as well (e.g. Ilyushin et al.,21 Persson et al.22). Additionally, the 12C/13C ratios of dimethyl ether, methyl formate, ethanol, and glycolaldehyde toward IRAS 16293B are all much lower than the 12C/13C ratio of the local ISM.16,19 By convention, the deuterium ratios are reported as D/H while the 13C ratios are reported in the inverse manner. Therefore, a high D/H ratio and low 12C/13C ratio both denote isotopic enhancement.
A large portion of the remaining unassigned spectral peaks are predicted to arise from isotopically substituted species. In fact, Jørgensen et al.16 note that only 25% of the transitions correspond to the most common organic molecules detected in hot cores, including formaldehyde, methanol, methyl cyanide, isocyanic acid, ethanol, acetaldehyde, methyl formate, dimethyl ether, and ketene. Following this, they predict that the majority of the remaining transitions are likely related to various isotopically substituted molecules as well as more complex organic species. Thus, it is also vital to accurately model the column densities of isotopically substituted molecules so that the high abundance isotopologues in this source can be predicted, measured as needed in the laboratory, and their signals in the PILS survey identified and assigned.
Additionally, machine learning predictions of isotopic ratios are also useful since these ratios can act as tracers of the evolutionary history of an interstellar source along with the conditions, timescales, and pathways of molecular formation. For example, deuterium fractionation relies on gas-phase isotope exchange reactions that are strongly dependent on the temperature. Consequently, the deuterium fraction is a tracer of the conditions of the interstellar environment during molecular formation, with a high D/H ratio (i.e. high deuterium fraction) indicating cold formation temperatures.23,24 Therefore, accurate prediction of isotopic ratios would allow us to gain insight into the details of molecular formation and source history without requiring a dedicated search for these isotopically-substituted species that are often present in fairly low abundance.
In this work we apply the machine learning technique introduced by Lee et al.11 to IRAS 16293B. The machine learning pipeline used for this project along with the isotopic encoding is described in Section 3. Section 4.1 then presents the ability of the supervised machine learning regressors to model the molecular column densities in this source. Using these trained regression models, we obtain an unbiased list of predicted high-abundance targets for astronomical observation. Analysis of these molecular targets is presented in Section 4.2. Next, in Section 4.3 we test the ability of the regressors to model the isotopic ratios in this source. Finally, a list of high predicted column density isotopologues is provided in Section 4.4.
In order to include isotopologues in the training set and investigate isotopic ratios, it was necessary to encode isotopic composition in the feature vectors. In its current form, Mol2vec is not able to fully capture isotopic information. For example, it creates unique vectors for deuterium-substituted molecules but not for molecules that are substituted with 13C. This is because the molecular substructures are first encoded using Morgan fingerprints, which do not by-default capture differences in 13C-substituted isotopologues (e.g. the default RDKit-constructed32 Morgan fingerprint33 of H2CO and H213CO are identical). To differentiate between each of the isotopologues in our dataset, we ensure that the Mol2vec-generated vectors are identical for all isotopologues of the same species and then add 19 extra dimensions that encode isotopic information as well as the chemical environment of the isotopic substitution (Fig. 1). The isotopic encoding is designed as follows:
• Dimensions 1–9: number of D, 34S, 33S, 36S, 13C, 17O, 18O, 15N, and 37Cl atoms in the molecule.
• Dimensions 10–12: whether the 13C atoms are sp, sp2 or sp3 hybridized.
• Dimension 13: whether the substituted 13C atom is bonded to oxygen.
• Dimension 14–15: whether a deuterium atom is bonded to carbon or oxygen.
• Dimension 16: number of non-hydrogen atoms in the molecule.
• Dimensions 17–18: whether there is an oxygen or carbon atom two bonds away from the substituted deuterium.
• Dimension 19: number of deuterium atoms bonded to carbonyl carbons.
The RDKit module was used to obtain hybridization and bonding information.32 Because of the limited number of unique isotopologues in our dataset, the selection of these hand-picked features was largely dependent on which chemical substructures are present in enough molecules to constitute a reasonably sized training set. More specifically, each of the isotopic features denoted by dimensions 10–19 are present in at least three molecules in the dataset. Each of the selected features also has a notable impact on the average isotopic ratio.
Additionally, Mol2vec was unable to differentiate between several conformers of the detected molecules. Examples include ethylene glycol (for which both the aGg′ and gGg′ conformers were detected) as well as monodeuterated CH2DCH2OH and CH2DOCH3. For consistency, we inputted the column density of the most stable or abundant conformer in each case.
Using the resulting feature vectors as inputs and the log10 column densities as outputs, the data was split 80/20 into training and testing sets. In order to mitigate data leakage, all isotopologues of the same molecule were assigned to either the training or testing set. The datapoints were then bootstrapped with Gaussian noise in order to increase the effective dataset size to 800 and control overfitting.
These resulting training and testing sets were then fed into two separate supervised machine learning regressors: Gaussian process regression (GPR)34 and Bayesian ridge regression (BR). These models learn relationships between the vector components to map the molecular features to the column density data. Each of the models were implemented with the SCIKIT-LEARN Python module.35 We determined the optimal hyperparameters for each model by first splitting the data into training and testing sets and then running a 5-fold grid search on the training data.
GPR is a nonparametric model that defines a probability distribution over all functions that can map the molecular descriptors to the column densities. It is therefore able to handle nonlinear relationships in the data. A kernel provides the model with prior knowledge regarding the shape and smoothness of the functions. Similarly to Lee et al.,11 the kernel we used was a linear combination of the rational quadratic, dot product, and white noise kernels. Along with the kernel function, additional hyperparameters include a noise value added to the kernel matrix diagonal that denotes the inherent Gaussian noise of the training observations.
BR is a linear regressor that takes a probabilistic approach to optimize the ridge regression model coefficients. It does this by using a gamma distribution prior for the regularization coefficients. These parameters are then optimized through maximization of the log marginal likelihood. For this regressor, the hyperparameters define the shape and inverse scale of the Gamma distribution priors over the various model parameters.
Similarly to Lee et al.,11 a linear model is included in order to provide a baseline performance using an extremely simple model with a limited number of parameters. Bayesian ridge was specifically chosen due to its ability to report prediction uncertainties, which allows us to gauge the confidence level of the predicted values. A GPR model then displays the ability of a more complex model to improve upon this baseline regressor. GPR was also chosen due to its probabilistic and nonlinear nature. The reported uncertainties are fairly informative and interpretable since they can be linked directly to the designed covariance matrix.
Following the training of the regression models, the column densities of the molecules that are most chemically similar to those detected in IRAS 16293B were predicted using the trained models. K-means clustering with k = 10 was used to cluster the entire dataset of 3634046 feature vectors. Each of the molecules detected in IRAS 16293B was assigned to a single cluster. Thus, we only considered the molecules assigned to this cluster when searching for detectable new species.
When analyzing the ability of the regressors to model the molecules in the training set and subsequently predict the column densities of the species in the testing set, we were limited to the molecules that have been previously detected toward IRAS 16293B. We were therefore only able to gauge the performance of these models on molecules that are relevant to this fairly small and homogeneous dataset. Consequently, we proceeded to remove the molecules that had no or few chemically similar examples in the dataset since the models did not have sufficient training examples to learn the required relationships for these species and we had no ability to gauge the accuracy of the model predictions. For this additional filtering, we removed molecules that contained atoms other than hydrogen, carbon, sulfur, nitrogen, and oxygen because all of the previously detected molecules were predominantly composed of these atoms. Additionally, we also removed the remaining free radical species because nitrous oxide is the only free radical for which a column density has been derived toward IRAS 16293B. This minuscule training set of free radicals resulted in the models not being able to sufficiently handle these molecules. For example, the free radical counter-parts of various molecules had much higher predicted column densities than the observed values of the parent species. This is very unlikely due to the instability of free radicals and the general under abundance of radicals in protostellar sources. In fact, while 36.5% of interstellar molecules were first detected in star-forming regions, only 13.5% of radical species were first detected in these protostellar sources.36 This under abundance may be due to the larger gas-phase chemical inventory and warmer kinetic temperatures leading to a greater number of destruction partners for these highly reactive species. Following these filtering steps, the column densities of the remaining 84863 molecules were then predicted using the previously trained regression models.
Additionally, although the BR model showed an ability to precisely model the column densities with lower uncertainties than the GPR regressor, it was mainly limited by its linear mapping. With our current isotope encoding, a linear model will be unable to fully capture the relevant isotopic fractionation. For example, the difference between a singly and doubly deuterated molecule is in-part denoted with a 2 instead of a 1 in a single vector dimension. That said, the difference in the column densities of singly and doubly deuterated species is typically not simply 2/1 and can differ significantly between molecules. Thus, for the remainder of the analysis, a GPR model was used since a nonlinear mapping was required.
The large error bars on the GPR predictions are in part due to the small size of the dataset. Additional molecular detections (especially of main isotopologues) toward this source will allow for further constrained predictions. Moreover, despite the overall strong performance of the GPR model, this regressor overpredicts ethylene glycol (OHCH2CH2OH) and dimethyl ether (CH3OCH3) by over one order of magnitude. This prediction inaccuracy is likely because these molecules have few nearby neighbors in the training set. In fact, these molecules are the 10th and 12th furthest species from any neighbor in the dataset, respectively. Ethylene glycol is especially unique in that it is the only molecule in the dataset containing two hydroxyl groups. Additionally, ethylene glycol's nearest neighbors are methoxymethanol and ethanol, each of which are more abundant.
Another notable prediction error is the slight overprediction of chloromethane and underprediction of methanol since it highlights the shortcomings of our molecular featurization. Mol2vec creates molecular feature vectors by combining vector representations of chemical substructures. Therefore, small molecules with some shared substructures have extremely similar feature vectors. In this case, Mol2vec generates similar feature vectors for all molecules that contain a methyl group bonded to a single heteroatom. Despite obvious chemical differences, the resulting vector representations of chloromethane and methanol are therefore very similar. Since methanol is one of the most abundant molecules in the dataset and chloromethane is one of the least abundant, the resulting prediction errors are fairly unsurprising.
In order to test the efficacy of the chosen kernel for the GPR model, we predicted the column density of cimetidine C10H16N6S. This molecule is far more complex than any other species in our dataset and would certainly have an extremely low abundance in the interstellar medium. Therefore, an effective kernel would also produce a low column density prediction for this species. Ultimately, the trained GPR model predicted this molecule to have a column density of 6.94 × 106 cm−2, which is nearly eight orders of magnitude lower than any main isotopologue in the dataset. This provides confidence that the model is not over-fitting to the dataset and simply learning to predict each column density to be in the range of the detected species.
The chemical composition of the predicted molecules is displayed in Table 1. Oxygenated hydrocarbons were typically predicted to be in high abundance while those containing nitrogen and sulfur were predicted to have lower column densities. Of the 20 highest predicted column density molecules, 15 contain at least one oxygen atom, while 2 contain a nitrogen atom, and 1 contains a sulfur atom. The preference for oxygen-substituted molecules is not surprising since the most abundant detected species in IRAS 16293B are carbon monoxide (CO), methanol (CH3OH), formaldehyde (H2CO), methyl formate (HCOOCH3), dimethyl ether (CH3OCH3), carbonyl sulfide (OCS), and ethanol (CH3CH2OH) – each of which contain an oxygen atom.
20 highest predicted abundance molecules | 20 lowest predicted abundance molecules | |
---|---|---|
Mean # of oxygen atoms | 1.65 | 0.75 |
Mean # of nitrogen atoms | 0.10 | 1.05 |
Mean # of sulfur atoms | 0.10 | 0.75 |
Mean degree of unsaturation | 0.75 | 1.65 |
Mean # of heavy atoms | 3.85 | 4.25 |
Mean molecular weight (amu) | 59.66 | 74.35 |
Highly saturated molecules were also predicted to be very abundant in IRAS 16293B. This is to be expected in a protostellar source since hydrogenation is very efficient on grain surfaces. Therefore, many of the species that are sublimated from grains as the protostar heats the surroundings are highly saturated (e.g. Linnartz et al.,37 Fedoseev et al.,38 Woon,39 Garrod et al.40).
The predictions also display a preference for lighter molecules that contain less heavy atoms. This also matches the detected chemical inventory in IRAS 16293B, in which the seven highest abundance molecules each contain four or less heavy atoms.
The proceeding subsections highlight the astrochemical relevance of some of the molecules with the highest predicted column densities in Fig. 3. It is important to note that while a high column density is beneficial for interstellar molecular detectability, various additional factors must also be considered including the magnitude of the dipole moment, the spectral pattern, and intrinsic line strengths.
Fig. 4 displays the predicted column densities of the D and 13C substituted isotopologues along with the corresponding isotopic ratios. These predictions stem from 5-fold cross validation on the isotopically substituted data. In this process, the isotopologues are split into five subsets of training and validation data. In each iteration, 20% of the isotopically substituted molecules are left out of the training set. The model is then trained on all molecules in the dataset besides the 20% of rare isotopologues that were assigned to the validation set.
Because the deuterium and 13C ratios are reported in inverse fashions, the mean squared errors of the two ratio plots differ dramatically. The points within the shaded regions denote the molecules for which the prediction error is less than the mean absolute prediction error. For the D/H ratios, the mean absolute error is 0.032. For the 12C/13C ratios, the mean absolute error is 20.9.
The column density predictions for isotopically substituted molecules are typically extremely accurate. However, when considering isotopic ratios, the range of realistic values is quite limited; therefore, a small prediction error is very notable. For example, deuterated acetaldehyde in IRAS 16293B is observed to have a D/H ratio of 7.98%. A column density under-prediction by 0.3 orders of magnitude would result in a predicted ratio of approximately 4.00%. This ratio suggests very different temperatures and timescales of formation.
There are a few molecules for which the ratio prediction is especially inaccurate. These species are labelled on Fig. 4. The prediction errors of HDCS and CH2DOCHO are especially notable since they highlight the shortcomings of using hand-picked descriptors. When encoding the chemical environment of the deuterium substitution, vector dimensions are included that denote whether there is a carbon or oxygen atom two bonds away from the deuterium atom. However, there was no consideration of whether the atom in this position is sulfur since HDCS is the only molecule in the dataset in which this is the case. Therefore, this important chemical environment information is not provided to the model, thus leading to a large prediction error. Additionally, with simple hand-picked features, the model isn't always able to capture the nuances of isotopic fractionation. For example, the isotopic encoding of CH2DOCHO is very similar to that of CH2DOH. Because CH2DOH has a D/H ratio of only around 7%, the model inaccurately predicts CH2DOCHO to have approximately the same ratio.
Preferably we could include a more nuanced encoding of isotopic composition that better captures the local chemical environment instead of simple hand-picked features. However, with only 27 deuterated molecules and 15 13C substituted species, the dataset of unique isotopologues is too small to learn the required relationships with a complex featurization. As mentioned previously, Mol2vec is sensitive to some, but not all, isotopic substitutions. It can, however, create unique vectors for deuterated species. Therefore, we tested the ability to learn deuterium ratios from the original Mol2vec-produced vectors that more fully consider chemical context. These results are shown in Fig. 5. The D/H ratio of formic acid is omitted from the graph since a ratio of around 6 is predicted which skews the ability to view the remaining points. As can be seen, these predictions are far less accurate than when hand-picked features were used. This is because the vector representations of many of the deuterated species are quite dissimilar to the main isotopologue in this case. In fact, the vector representation of CH2DCH2OH is closer to that of propanal than that of CH3CH2OH. In order to include more detailed isotopic information in the feature vectors and thus accurately model isotopic fractionation, it is clear that we require additional isotopologue detections.
Finally, in order to test the predictive ability of the model on isotopologues for which no column density has been derived, we proceeded to predict the 12C/13C ratio of CO and CN with the trained GPR model from Section 4.1. For these two molecules (along with H2CO), a linear 12C/13C trend has been defined as a function of galactocentric distance (DGC). The formulae of these galactocentric gradients are displayed in eqn (1)–(3)
12CO/13CO = (5.41 ± 1.07)kpc−1 × DGC + (19.03 ± 7.90) | (1) |
12CN/13CN = (6.01 ± 1.19)kpc−1 × DGC + (12.28 ± 9.33) | (2) |
H212CO/H213CO = (7.60 ± 1.79)kpc−1 × DGC + (18.05 ± 10.88) | (3) |
Using a galactocentric distance of 8.043 kpc for IRAS 16293, the range of expected ratios along with the ratios predicted by the GPR model are shown in Table 2. For reference, the observed 12C/13C ratio of H2CO is listed as well.22 The Galactocentric distance of IRAS 16293 was computed using the ASTROPY Python module.62 The values used in this calculation were the distances from the Earth to both the Galactic Center and IRAS 16293 (8.178 kpc and 141 pc (ref. 63 and 64)) as well as their respective sky coordinates.
Molecule | Galactocentric gradient 12C/13C ratio | Predicted 12C/13C ratio | Observed 12C/13C |
---|---|---|---|
CO | 46.04–79.05 | 50.85 | — |
CN | 41.72–79.52 | 27.34 | — |
H2CO | 53.90–104.46 | — | 52.92 |
The observed 12C/13C ratio of formaldehyde toward IRAS 16293B is very near the lower bound of the galactocentric trend error bars at 8.043 kpc. Interestingly, the GPR model predicts the CO and CN ratios to also be fairly near the lower bounds of the respective ratio gradients. This matches the observed trends of various other molecules in IRAS 16293B, which typically show high levels of 13C substitution. A high abundance of 13CO could stem from favorable ion-neutral isotope exchange reactions in the cold interstellar gas before CO freeze-out.65 Complex organic species that were then formed on grain surfaces from CO following freeze-out would inherit this small 12C/13C ratio. The 13C enhancement in organic molecules would be even more notable at later timescales since laboratory experiments have shown that 12CO desorbs slightly more efficiently than 13CO.66 This enables 12CO to sublimate from the grain at a lower temperature. Therefore, as the protostar begins to heat the surroundings to around the CO sublimation temperature, the more efficient 12CO sublimation would result in grain surfaces that are further enhanced with 13CO.
Overall, while the machine learning regressor is not precise enough to adequately model the exact isotopic ratios, the 13CO and 13CN predictions show that it still is able to learn the general overabundance of 13C in the organic species of IRAS 16293B.
Of these highest 10 predicted column density species, both 13C-substituted ethanol isotopologues and OD substituted ethanol were marginally detected toward IRAS 16293B. While we did not include these molecules in our training set, Jørgensen et al.19 derived tentative column densities for these species. These tentatively derived column densities along with the values predicted by the GPR model are listed in Table 3. As can be seen, all predictions closely match the tentative column densities. Additionally, while transitions corresponding to 13CH3OH and DCN have been identified toward IRAS 16293, no derived column density is listed.67
Molecule | Tentative column density (cm−2) | Predicted column density (cm−2) |
---|---|---|
CH3CH2OD | 1.1 × 1016 | 2.15 × 1016 |
13CH3CH2OH | 9.1 × 1015 | 1.33 × 1016 |
CH313CH2OH | 9.1 × 1015 | 1.18 × 1016 |
Therefore, the remaining molecules of interest are the three deuterated isotopologues of methoxymethanol along with the 13C isotopologue of methyl formate. Beyond these largest 10 predicted column density isotopologues, another high predicted abundance isotopologue is the deuterated isotopologue of methoxyethane (CH3CH2OCH2D). As mentioned previously, while large column densities are beneficial for detection, various other factors impact the detectability of a molecule. For all of these molecules, we therefore simulated their spectra toward IRAS 16293B using the predicted column densities in order to assess their true detectability. The microwave and sub-mm spectra of 13CH3OCHO have been experimentally studied and assigned, thus making interstellar detection currently possible. In fact, this rare isotopologue was detected in the Orion molecular cloud.68 However, this particular isotopologue is not present in the CDMS molecular spectroscopy database.69 All of the other aforementioned deuterated isotopologues have not been studied experimentally.
In order to simulate the spectra of these previously unstudied isotopologues, the rotational constants must first be calculated. A low-cost method to obtain molecular rotational constants for isotopically substituted species can be achieved by combining experimental data and ab initio calculations. In this process, the experimental rotational and distortion constants of the main isotopologue are first collected. The A, B, and C rotational constants of the parent species are then calculated at a given level of theory and basis set. For our work, we ran the calculations with the PSI470 Python package and used the M06-2X functional with the 6-311++G(d,p) basis set. Assuming that the geometry remains constant upon isotopic substitution, the same computational methods are then used to calculate rotational constants of the rare isotopologues. Finally, it is assumed that the scaling factor between the experimental and calculated rotational constants is the same for the main isotopologues and the isotopically substituted molecules. For example, the B rotational constant of the isotopically substituted species can be calculated using eqn (4).
(4) |
The experimental distortion constants and dipole moments of the main isotopologues were used as-is for the isotopically substituted molecules.
Following the rotational constant calculations, a rotational line catalog was generated using Pickett's SPCAT.71 Only the A, B, and C rotational constants and distortion constants (when available) were used. Internal rotational was not considered during the catalog simulations. The molsim72 Python package was then used to simulate the spectra of the isotopologues toward IRAS 16293B with the predicted column densities. Molsim assumes that the molecular emission can be described by a single excitation temperature, and accounts for the effects of optical depth. For the simulations, the excitation temperature and vlsr of the main isotopologue were used. A source size of 0.5′′, beam diameter of 0.5′′, and line width of 1.0 km s−1 were used for each simulation.
Since ALMA Band 7 (329.147–362.896 GHz) is fully covered with the PILS observations, this frequency range is the predominant focus of our spectral simulations. Given the noise level of the PILS observations,16 any transition with a peak intensity stronger than ∼21 mJy per beam should be detectable at a 3σ significance. Analysis of the spectral simulations can be seen in Table 4. The simulated spectra of the 13C-substituted methyl formate isotopologue and the deuterated methoxyethane isotopologue toward IRAS 16293B are presented in Fig. 7.
Molecule | # of 3σ transitions | # of 1σ transitions | Intensity of strongest transition (mJy per beam) |
---|---|---|---|
13CH3OCHO | 81 | 86 | 571.0 |
CH3CH2OCH2D | 12 | 18 | 64.8 |
CH2DOCH2OH | 0 | 0 | 0.39 |
CH3OCHDOH | 0 | 0 | 0.18 |
CH3OCH2OD | 0 | 0 | 3.92 × 10−8 |
Despite having higher predicted column densities, the rare isotopologues of methoxymethanol are predicted to have much weaker spectral peaks than the other isotopologues considered due to the limited dipole moment. That said, with several transitions predicted to be stronger than 3σ, CH3CH2OCH2D is an excellent candidate for experimental study for astrochemical purposes. Additionally, since the spectrum of 13CH3OCHO has already been collected and assigned, we recommend that this molecule be searched for in the PILS data.
This machine learning method has now been shown to effectively model the molecular column densities in two separate interstellar sources and the resulting trained regression models can be used to predict molecular species that are likely abundant in these various regions of interstellar space. However, these same techniques can be readily applied to terrestrial chemical mixture identifications as well. For example, if a researcher is able to reliably identify a fairly small number of chemical components present in an environmental sample along with their abundances, these supervised regressors could be trained and used to predict other components and contaminants of that mixture.
Formula | SMILES | Observed column density | Reference |
---|---|---|---|
CO | [C−]#[O+] | 20.0000 | Drozdovskaya et al.25 |
CH3OH | CO | 19.0000 | Jørgensen et al.16,19 |
H2CO | CO | 18.2800 | Persson et al.22 |
CH3CH2OH | CCO | 17.3600 | Jørgensen et al.19 |
CH3OCH3 | COC | 17.3800 | Jørgensen et al.19 |
HCOOCH3 | COCO | 17.4100 | Jørgensen et al.19 |
CH2OHCHO | OCCO | 16.5100 | Jørgensen et al.16 |
CH3COOH | CC(O)O | 15.4500 | Jørgensen et al.16 |
CH3CHO | CCO | 17.0800 | Jørgensen et al.19 |
c-C2H4O | C1CO1 | 15.7300 | Lykke et al.73 |
HCOOH | OCO | 16.7500 | Jørgensen et al.19 |
aGg′-(CH2OH)2 | OCCO | 16.7200 | Jørgensen et al.16 |
CH3OCH2OH | COCO | 17.1500 | Manigand et al.52 |
C2H5CHO | CCCO | 15.3400 | Lykke et al.73 |
(CH3)2CO | CC(C)O | 16.2300 | Lykke et al.73 |
NH2CHO | NCO | 15.9800 | Coutens et al.20 |
HCN | C#N | 16.7000 | Drozdovskaya et al.25 |
CH3CN | CC#N | 16.6000 | Calcutt et al.74 |
CH3NC | [C−]#[N+]C | 14.3000 | Calcutt et al.75 |
HNCO | NCO | 16.5700 | Ligterink et al.76 |
HC3N | C#CC#N | 14.2600 | Calcutt et al.74 |
H2S | S | 17.2300 | Drozdovskaya et al.59 |
OCS | OCS | 17.4000 | Drozdovskaya et al.59 |
CH3SH | CS | 15.6800 | Drozdovskaya et al.59 |
CS | [C−]#[S+] | 15.5900 | Drozdovskaya et al.59 |
H2CS | CS | 15.1100 | Drozdovskaya et al.59 |
SO | OS | 14.6400 | Drozdovskaya et al.59 |
CH3Cl | CCl | 14.6600 | Fayolle et al.77 |
C2H3CHO | CCCO | 14.5300 | Manigand et al.78 |
C3H6 | CCC | 16.6200 | Manigand et al.78 |
CH3CCH | C#CC | 16.0400 | Calcutt et al.79 |
t-C2H5OCH3 | CCOC | 16.2600 | Manigand et al.52 |
C3H4O2 | OCCCO | 15.0000 | Coutens et al.80 |
CH3NCO | CNCO | 15.6000 | Ligterink et al.76 |
C2H5CN | CCC#N | 15.5600 | Calcutt et al.74 |
C2H3CN | CCC#N | 14.8700 | Calcutt et al.74 |
CH2CO | CCO | 16.6800 | Jørgensen et al.19 |
HONO | ONO | 14.9500 | Coutens et al.81 |
NO | [N]O | 16.3000 | Ligterink et al.82 |
CH3C(O)NH2 | CC(N)O | 14.9500 | Ligterink et al.82 |
SO2 | OSO | 15.1100 | Drozdovskaya et al.59 |
t-HCOOH | OCO | 16.7500 | Jørgensen et al.19 |
CH2NH | CN | 14.9031 | Ligterink et al.82 |
H213CO | [13CH2]O | 16.5563 | Persson et al.22 |
H2C17O | C[17O] | 14.8573 | Persson et al.22 |
H2C18O | C[18O] | 15.3617 | Persson et al.22 |
HDCO | [2H]CO | 17.1139 | Persson et al.22 |
D2CO | [2H]C([2H])O | 16.2041 | Persson et al.22 |
D213CO | [2H][13C]([2H])O | 14.3424 | Persson et al.22 |
HC15N | C#[15N] | 14.3979 | Drozdovskaya et al.25 |
13CH3CN | [13CH3]C#N | 14.7782 | Calcutt et al.74 |
CH313CN | C[13C]#N | 14.6990 | Calcutt et al.74 |
CH3C15N | CC#[15N] | 14.2041 | Calcutt et al.74 |
CH2DCN | [2H]CC#N | 15.1461 | Calcutt et al.74 |
CHD2CN | [2H]C([2H])C#N | 14.3010 | Calcutt et al.74 |
34SO2 | O[34SO] | 14.6021 | Drozdovskaya et al.59 |
O13CS | O[13CS] | 15.6990 | Drozdovskaya et al.59 |
OC34S | OC[34S] | 16.0000 | Drozdovskaya et al.59 |
OC33S | OC[33S] | 15.4771 | Drozdovskaya et al.59 |
18OCS | [18O]CS | 14.6990 | Drozdovskaya et al.59 |
C34S | [C−]#[34S+] | 14.3010 | Drozdovskaya et al.59 |
C33S | [C−]#[33S+] | 13.9031 | Drozdovskaya et al.59 |
C36S | [C−]#[36S+] | 13.1461 | Drozdovskaya et al.59 |
HDCS | [2H]CS | 14.1761 | Drozdovskaya et al.59 |
HDS | [2H]S | 16.2041 | Drozdovskaya et al.59 |
HD34S | [2H][34SH] | 15.0000 | Drozdovskaya et al.59 |
CD3OH | [2H]C([2H)([2H)O | 16.4914 | Ilyushin et al.21 |
CH2DOH | [2H]CO | 17.8513 | Jørgensen et al.19 |
CH3OD | [2H]OC | 17.2553 | Jørgensen et al.19 |
a-CH3CHDOH | [2H]C(C)O | 16.3617 | Jørgensen et al.19 |
CH3OCDO | [2H]C(O)OC | 16.1761 | Jørgensen et al.19 |
CH2DOCHO | [2H]COCO | 16.6812 | Jørgensen et al.19 |
CHDCO | [2H]CCO | 15.3010 | Jørgensen et al.19 |
13CH3OCH3 | CO[13CH3] | 16.1461 | Jørgensen et al.19 |
CH3CDO | [2H]C(C)O | 15.9823 | Jørgensen et al.19 |
H13COOH | O[13CH]O | 14.9191 | Jørgensen et al.19 |
CHD2OCHO | [2H]C([2H)OCO | 16.0414 | Manigand et al.83 |
CH337Cl | C[37Cl] | 14.3424 | Fayolle et al.77 |
NH2CDO | [2H]C(N)O | 14.3222 | Coutens et al.20 |
NH213CHO | N[13CH]O | 14.1761 | Coutens et al.20 |
DNCO | [2H]NCO | 14.4771 | Coutens et al.20 |
HN13CO | N[13C]O | 14.6021 | Coutens et al.20 |
CHDOHCHO | [2H]C(O)CO | 15.5211 | Jørgensen et al.16 |
CH2ODCHO | [2H]OCCO | 15.1761 | Jørgensen et al.16 |
CH2OHCDO | [2H]C(O)CO | 15.2148 | Jørgensen et al.16 |
CH318OH | C[18OH] | 16.2718 | Jørgensen et al.16 |
13CH2CO | [13CH2]CO | 14.8513 | Jørgensen et al.19 |
CH213CO | C[13C]O | 14.8513 | Jørgensen et al.19 |
13CH3CHO | [13CH3]CO | 15.2553 | Jørgensen et al.19 |
CH313CHO | C[13CH]O | 15.2553 | Jørgensen et al.19 |
t-DCOOH | [2H]C(O)O | 15.0414 | Jørgensen et al.19 |
t-HCOOD | [2H]OCO | 15.0414 | Jørgensen et al.19 |
a–a-CH2DCH2OH | [2H]CCO | 16.4313 | Jørgensen et al.19 |
Asym-CH2DOCH3 | [2H]COC | 16.6128 | Jørgensen et al.19 |
This journal is © The Royal Society of Chemistry 2023 |