Hart
Plommer
ab,
Isaiah O.
Betinol
a,
Tom
Dupree
b,
Markus
Roggen
*b and
Jolene P.
Reid
*a
aDepartment of Chemistry, University of British Columbia, Vancouver, British Columbia V6T 1Z1, Canada. E-mail: jreid@chem.ubc.ca
bDelic Labs, 3800 Wesbrook Mall, Vancouver, British Columbia V6S 2L9, Canada. E-mail: markus@deliclabs.com
First published on 28th November 2023
The extraction of compounds from natural sources is essential to organic chemistry, from identifying bioactive molecules for potential therapeutics to obtaining complex, chiral molecule building blocks. One industry that is currently leading in innovation of new botanical extraction methods and products is the cannabis industry, although it is still hampered by a lack of efficiency. Similar to chemical syntheses, anticipating the extraction conditions (flow rate, time, pressure, etc.) that will lead to the highest purity or recovery of a target molecule, like cannabinoids, is difficult. Machine learning algorithms have been demonstrated to streamline reaction optimization processes by constraining the parameter space to be physically tested to predicted regions of high performance; however, it is not altogether clear if these techniques extend to the optimization of extractions where the process conditions are even more expensive to evaluate, limiting the data available for assessment. Combining information from several sources could provide access to the requisite data necessary for implementing a data-driven approach to optimization, but little data has been made publicly available. To address this challenge and to evaluate the capabilities of machine learning for optimizing extraction processes, we built a dataset on the carbon dioxide supercritical fluid extraction (CO2 SFE) of cannabis by harmonizing data from various companies. Using this combinatorial dataset and new techniques for maximizing the information obtained from a single large scale experiment, we built robust machine learning models to accurately predict extraction yields. The resulting machine learning models also allow for the prediction of out-of-sample biomass variations, process conditions, and scales.
The dichotomy between the two approaches for obtaining important organic molecules is also expressed in the type and amount of data that is used to train powerful ML models. Most efforts in chemical synthesis rely on data sets extracted from the literature,5,12–14 produced by high throughput experimentation,4,15,16 or a combination of the two.17 However, these data sources do not exist or are in limited supply for training robust ML models for extraction processes. One potential explanation is that these separations are generally performed on large scale (input > 1 kg) making the process conditions more expensive to evaluate. Therefore, the effective implementation of these tools will require innovative ways for maximizing the information gained from running a single large-scale experiment.
Considering these challenges and the importance of obtaining complex organic molecules from natural sources, we have evaluated the ability of ML algorithms to predict the impact of extraction conditions on compound recovery. The setting for this study is carbon dioxide supercritical fluid extraction (CO2 SFE), one of the major techniques to recover cannabinoids from cannabis and hemp with high levels of enrichment. Supercritical CO2 extraction stands out in cannabis processing for its efficiency and integrity in separating plant components, ensuring high-quality, safe extracts. This environmentally and economically favorable method aligns with pharmaceutical and food industry standards, preserving the essential properties of cannabis for optimal utilization.18,19
The composition of cannabis is rich with organic molecules including terpenes, phenolics, and cannabinoids, most of which are present in very small quantities (<1 wt%).20,21 Two major exceptions to this are cannabidiolic acid (CBDA) and tetrahydrocannabinolic acid (THCA), typically existing in >10 wt%. These acidic cannabinoids are converted into their neutral counterparts, cannabidiol (CBD) and tetrahydrocannabinol (THC), upon exposure to heat. Owing to their ubiquity, CBD and THC have been the focus of bioactive research in recent years. THC is the primary psychoactive and intoxicating component, while CBD is non-intoxicating but has received interest due to its antiseizure, antianxiety, and analgesic activities.22 In 2018, the FDA approved the first CBD-containing medicine, Epidiolex, for the treatment of epilepsy.23 These factors combined with favourable changes in the legal status of cannabis in many US states and Canada have led to a substantial demand for cannabis goods including cannabinoid concentrates and infused products.
Despite the potential of cannabinoids in pharmaceutical and consumer products, the lab-intensive and time-consuming extraction process impedes applications of these molecules. Consequently, there is an urgent need to develop more effective extraction methods to access these high-value compounds. Herein, we report that a random forest ML model trained on extraction parameters can be applied to predict the yield of cannabinoids obtained from CO2 SFE. To achieve this, we built a unique extraction database gathering results across several industry platforms (3 different instruments and 14 different cultivars) and describing the impact of various process conditions on the yield of a diverse set of cannabinoids. Our workflow includes new techniques for amplifying the information gathered from a single large-scale experiment, while also providing a prediction platform for extraction outcomes of untested conditions, cultivars, and scales (Fig. 1, bottom).
In compiling the database another important factor to consider was how to measure the efficacy of the extraction process for each of the compounds. Reporting the recovery is the traditional method applied, but for some cannabinoid materials the mass does not simply reflect the original plant composition. As such, for each compound class we calculated the yield which we define as the amount of material obtained from the extraction process as a percentage of the total cannabinoid mass within the input biomass (see ESI for full discussion†). This was viewed as a simple but crucial means to transform the data to a comprehensible scale where the percentages of each cannabinoid obtained would be recorded below 100%. The information is compiled to create a unique database, which we call CannaLit, consisting of 632 data entries covering eight different cannabinoid materials (Fig. 2A).
The yield range reported for some cannabinoids is sparse and typically biased towards lower amounts which is expected to be key for accurate predictions. In many cases this value simply reflects the compound's natural abundance, while in other samples, the low number is a result of the complex chemical reactions occurring throughout the extraction process. For example, CBDA and THCA are thermally unstable and eventually converted to CBD and THC at elevated temperatures and times.24–27 Accordingly, it is expected that certain materials are strongly biased towards low yields as they are eventually converted to other compounds. Despite the similarities in the reported range of CBD and THC, the yield distribution dramatically differs. Further inspection shows most CBD recoveries are recorded over 50% while most THC recoveries fall below 50%. This symmetry demonstrates how extraction yields can vary across cultivars even with similar conditions, in this case owing to the fact that CBD is more soluble in CO2 than THC.28,29 Explicitly accounting for different cultivar yields in a single extraction avoids the narrow yield distribution when analysing individual cannabinoids, minimizes biased experiment selection, and could offer improved predictive performances.
The dataset describes the impact of 10 distinctive parameters on the recovery of cannabinoids. Most of the extraction conditions are naturally featurized (e.g., flow rate, temperature, density, input masses etc.) with one-hot encoding required for interpreting categorical data (e.g., cannabinoid and extractor type). Coverage of the continuous process conditions is sparse and is strongly biased toward operating conditions with 330–370 bar and 60–70 °C. This data driven analysis of extraction conditions shows that industrial operators seem to focus on specific process regimes to search for high yield performance (Fig. 2B). To understand the benefit of building and employing a predictive model it is worthwhile to enumerate the accessible process space. Recognising that there are limits in adjusting certain process parameters (e.g., very high flow rates and pressures are not viable options) we calculated the number of all reasonable possible combinations of continuous variables of extraction time, temperature, pressure, and flow rate.
This amounts to 4.4 × 107 possible process conditions to be tried for a single biomass input. These results hint that developing an accurate model for extraction would allow for the prediction of cannabinoid yields for an enormous set of unperformed process conditions.
A comparison of the ML modeling processes between the two approaches for accessing complex organic molecules show inherent differences in terms of data diversity and output range distributions. This assessment is important as it allows further understanding of how performances displayed by ML on predicting the outcome of synthetic reactions could translate to extraction processes. To standardize the comparison, we sought data sets of similar size that were generated in a comparable way of combining different data sources. Considering these constraints, we compared our data set to a portion of another dataset compiled from the synthetic literature, NiCOlit,13 restricted to reactions including a boron coupling partner (677 reactions, a comparable number to our data set of 632 data entries). Fig. 3 shows the yield distribution between the two datasets (i.e., restricted NiCOLit and CannaLit) to be similar, with both showing biases towards lower yields specifically in the range of 0–10%, and towards higher values around 80–100%. In contrast, CannaLit has significantly lower density coverage in the 20–80% range, which could introduce biases that have not yet been studied. Overall, despite similarities in the reported yield distribution, the underlying structure of the reaction data considerably differs. Because extraction processes are primarily unpublished there is no motivation to record only the best results; however, there is a clear tendency to rely on established conditions as additional changes to the set-up are prohibitive. In contrast, literature data focused on chemical syntheses often include a larger sampling of reaction conditions especially early in the optimization campaign when little is known about the impact of the reaction components on the experimental outcome. Therefore, optimization tables are a useful but limited source of negative data due to the reporting biases in the synthetic literature.
Regressor | Train R2 | Train MAE (%) | Test R2 | Test MAE (%) | 10-fold CV | LOO |
---|---|---|---|---|---|---|
a RF model created from only entries with available DFT features. | ||||||
Random forest | 0.99 | 1.30 | 0.96 | 3.52 | 0.97 | 0.98 |
XGBoost | 1.00 | 1.09 | 0.96 | 3.30 | 0.98 | 0.98 |
SVR | 0.52 | 16.15 | 0.47 | 18.02 | 0.46 | 0.49 |
kNN | 1.00 | 0.02 | 0.98 | 2.78 | 0.97 | 0.97 |
Linear regression | 0.40 | 21.03 | 0.42 | 21.84 | 0.34 | 0.37 |
ElasticNet | 0.37 | 23.41 | 0.38 | 24.62 | 0.34 | 0.35 |
Ridge | 0.40 | 21.03 | 0.42 | 21.84 | 0.35 | 0.37 |
LASSO | 0.39 | 21.75 | 0.42 | 22.55 | 0.35 | 0.37 |
NiCOLit13,a | 0.81 | 10.73 | 0.27 | 21.61 | 0.45 | 0.48 |
Fig. 4 RF regression model correlating the yield of various cannabinoids to a set of extractor operating conditions and categorical descriptors. |
Despite the clear experimental and yield distribution bias, the predictive performance on CannaLit is significantly better than the one reported for NiCOLit (test R2 = 0.27). In fact, when progressing to this learning task, we obtained some of the highest R2 observed thus far in any yield correlation. One explanation could be that extraction processes are naturally parameterized making the connection between set-up conditions and output relatively straightforward. This is in significant contrast to statistically modelling chemical syntheses where the structure of each reaction component must be described by carefully chosen numerical descriptors. While several reports demonstrate the need for appropriate featurization for high model accuracy,14,32 most reports link the moderate model performance to dataset distribution and size.6,13,33,34 These new findings could encourage the implementation of additional molecular representations to improve statistical model performances even with smaller, biased literature data sets.
To verify the model further we next sought to determine how each variable contributes to the overall result. Gini importance values suggest that the input cannabinoid mass has the greatest effect on the extraction yield, followed by the identity of the cannabinoid encoded as a categorical descriptor which is unsurprising given that this is the only descriptor that differentiates between identical sets of process conditions. Notably, this is congruent with previous reports that demonstrate that recovery of unique cannabinoids is impacted by different process conditions.11 Although identifying the key contributors to model performance confirms appropriate model logic, another important test is to determine the chemical validity of our chosen parameters and ensure the model is not simply learning the structure of the data.35 This is especially noteworthy given that our dataset is comprised of few individual extractions that are repeated. For each extraction, we replaced the features with an array of randomly generated numbers, essentially creating a unique barcode for each extraction but notably with no chemical information. Applying the model architecture to correlate these meaningless reaction barcodes shows the statistical scores to be much worse (R2 = 0.22, Q2 = −0.65, LOO = −0.65, and 10-fold CV = −0.65), suggesting that the regression model relies on chemically meaningful descriptors (see the ESI for more details†). Including the identity of the cannabinoid does improve the performance (R2 = 0.84, Q2 = 0.33, LOO = 0.29, 10-fold CV = 0.25); however, the statistics are still much worse than those with the chemically relevant features especially when comparing performance on cross-validation or external test sets.
Next, we performed tests to determine the extent that our data harmonization strategy improves over standard baseline models. As noted above, different cannabinoids can have different extraction yields based only on its chemical properties and not the extraction conditions. We compared our model to a model trained structural features of each cannabinoid (Morgan fingerprints36 with 2048 bits and radius = 2). Applying a random forest model to this dataset shows little predictive performance (R2 = 0.36, MAE = 21.0, test R2 = 0.38, test MAE = 21.7) and suggests that extraction conditions are key features for predicting recoveries.
On the basis of the key parameters identified in the ML model training and assessment, we surmised that the model could also be applied to predict the recoveries of scaled up procedures, an important goal in process development.
If the optimal extraction conditions are uncertain, scaling up a process can result in significant risk considering the substantial investments in resources, and simply put, a bad outcome would be more costly than running multiple smaller-scale processes to evaluate the process conditions. Consequently, it is unsurprising that a histogram shows that many of the processes are run on smaller scale with a small fraction operated with total input cannabinoid materials >400 g (Fig. 6A). Accordingly, the dataset was split into small- and large-scale processes using 400 g as the threshold for filtering the data into the two bins. The large-scale processes were used as the held-out data while the model was retrained on the remaining data and then deployed to predict recovery of this test set. Fig. 6B shows that the model accurately predicts the recovery for each of the 29 large scale data points with a small MAE of 6.8%. Ultimately, this demonstrates that our model also allows users to make informed decisions about scaling up a procedure and allowing costly resources to be conserved in the process.
As a final prediction task, we aimed to quantify the improvement that our combinatorial dataset provides. To this end, the most common way of exploring a defined feature space is through design of experiments (DoE) modelling. The data acquired from this process is then used as the basis for model generation and eventual prediction of how changing the set-up conditions will affect the extraction yields. While DoE datasets are highly enabling for specific substrates, they can provide subpar inferences in new areas of feature space (e.g., new cultivars). Despite this, the high cost of running extractions often means that DoE-designed data from one cultivar must be used to predict another. In this regard, a portion of the CannaLit database was generated through DoE to explore how extractor conditions affect cannabinoid extraction yields for cultivar 14 (Fig. 7, blue). We tested how this data would perform in predicting the extraction yield for cultivar 13 (Fig. 7, red) which requires similar conditions to those obtained for cultivar 14 (Fig. 7). While a random forest model trained on only the DoE generated data provides a lower error than a random forest model trained on CannaLit (MAE = 3.5% vs. 11.8%), it is clear that the CannaLit-trained model better captures the trends within the out of sample set. This is exemplified when comparing the range of predicted values where the CannaLit-trained model mirrors the observed range (5.4–22% predicted, 0–20% observed) while the DoE trained dataset only predicts values to be 0% yield (0–1.2% predicted, 0–20% observed).
Last, we performed each out-of-sample prediction task with a model trained on a THC-constrained version of CannaLit to demonstrate the benefits of our data harmonization strategy. This model exhibits similar training set statistics to the one trained on full CannaLit, however for each out-of-sample prediction task the model built on full CannaLit outperforms the THC-only model (see ESI for full discussion†). This suggests that harmonizing data does not hinder and may increase model generalizability. Comparisons were not performed with other cannabinoids as the requisite data for out-of-sample comparisons is not available.
Evaluation of various machine learning algorithms proved several tree-based models to be accurate for our purposes and we obtained some of the strongest yield correlations observed to date. More broadly, this successful outcome reinforces the utility of a data-driven approach to optimization and the need for carefully constructed databases to achieve adequate prediction power.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00176h |
This journal is © The Royal Society of Chemistry 2024 |