A simple and reliable QSPR model for prediction of chromatography retention indices of volatile organic compounds in peppers†
Abstract
Worldwide, various types of pepper are used in food as an additive due to their unique pungency, aroma, taste, and color. This spice is valued for its pungency contributed by the alkaloid piperine and aroma attributed to volatile essential oils. The essential oils are composed of volatile organic compounds (VOCs) in different concentrations and ratios. In chromatography, the identification of compounds is done by comparing obtained peaks with a reference standard. However, there are cases where reference standards are either unavailable or the chemical information of VOCs is not documented in reference libraries. To overcome these limitations, theoretical methodologies are applied to estimate the retention indices (RIs) of new VOCs. The aim of the present work is to develop a reliable QSPR model for the RIs of 273 identified VOCs of different types of pepper. Experimental retention indices were measured using comprehensive two-dimensional gas chromatography coupled to quadrupole mass spectrometry (GC × GC/qMS) using a coupled BPX5 and BP20 column system. The inbuilt Monte Carlo algorithm of CORAL software is used to generate QSPR models using the hybrid optimal descriptor extracted from a combination of SMILES and HFG (hydrogen-filled graph). The whole dataset of 273 VOCs is used to make ten splits, each of which is further divided into four sets: active training, passive training, calibration, and validation. The balance of correlation method with four target functions i.e. TF0 (WIIC = WCII = 0), TF1 (WIIC = 0.5 & WCII = 0), TF2 (WIIC = 0 & WCII = 0.3) and TF3 (WIIC = 0.5 & WCII = 0.3) is used. The results of the statistical parameters of each target function are compared with each other. The simultaneous application of the index of ideality of correlation (IIC) and correlation intensity index (CII) improves the predictive potential of the model. The best model is judged on the basis of the numerical value of R2 of the validation set. The statistical result of the best model for the validation set of split 6 computed with TF3 (WIIC = 0.5 & WCII = 0.3) is R2 = 0.9308, CCC = 0.9588, IIC = 0.7704, CII = 0.9549, Q2 = 0.9281 and RMSE = 0.544. The promoters of increase/decrease for RI are also extracted using the best model (split 6). Moreover, the proposed model was used for an external validation set.