Xiao
Liu
*,
Shi-Yuan
Wang
and
Jia
Wang
College of Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing, 400044, China. E-mail: liuxiao@cqu.edu.cn; Fax: +86-023-65103544; Tel: +86-023-65106960-707
First published on 23rd November 2011
At present, methods for determining essential genes depend on biochemical experiments. There is therefore a demand for the development of analysis methods and software for identifying essential genes, based on the common features of these genes. In this study, we employed the Hurst exponent as a characteristic parameter and analyzed its distribution among nine bacterial species. We found that most of the significance levels of the Hurst exponents of essential genes were higher than those of the corresponding full-gene-set. Conversely, most of the significance levels of the Hurst exponents of nonessential genes remained unchanged or only increased slightly. Therefore, we propose that this feature represents a restraint for pre- or post-design checking of bacterial essential genes in computer-aided design.
Insight, innovation, integrationOur study aimed to determine the common features of bacterial essential genes, to aid in the design of analysis methods and software for their identification. Because self-similarity exists in DNA sequences, we employed the Hurst exponent to characterize this property for genes in nine bacterial genomes, including essential and non-essential genes, and analyzed the distribution of their Hurst exponents, based on a normal distribution model, to study whether and how the self-similarity features change in reduced DNA sequences. We found that most of the significance levels of the Hurst exponents of the essential genes were higher than those of the corresponding full-gene-set. Conversely, most of the significance levels of the Hurst exponents of the nonessential genes remained unchanged or only increased slightly. Therefore, we propose that this feature represents a restraint for pre- or post-design checking of bacterial essential genes in computer-aided design. |
Two different approaches have been employed to construct a minimal genome: bottom-up and top-down. The bottom-up approach constructs an artificial genome by chemical synthesis, whereas the top-down approach uses an existing organism with the aim of simplifying its genome into a reduced form.3 Software and algorithms of computer-based design and simulation have been proposed to aid the synthetic approach. However, methods for determining essential genes still depend on biochemical experiments,4,5 highlighting the need for identification of characteristic features of these genes to aid investigations into genome reduction.
Long-range correlation (or self-similarity) is a phenomenon that may arise in the analysis of very large spatial or time series data sets. This phenomenon has also been observed in DNA sequence, in both coding and non-coding regions.6–8 The Hurst exponent is a characteristic parameter of self-similarity that has been used in many studies. Yu and Anh9 proposed a time series model based on the global structure of the complete genome, and used this model to study the classification and evolutionary relationship of bacteria. Liu et al.10 studied the distribution curves of the Hurst exponents of exon, intron and promoter sequences based on the Z-curve. Boekhorst et al.11 studied the efficiency of rescaled range analysis and detrended fluctuation analysis in distinguishing between coding DNA, regulatory DNA and non-coding non-regulatory DNA of Drosophila melanogaster. The long-range correlation between hydrophilicity and flexibility, a common feature of proteins, along some 80-calcium binding protein sequences was examined by Craciun et al.12 However, it remains unclear if and how these self-similarity features change in reduced DNA sequences.
In this study, we employed the Hurst exponent to characterize the self-similarity of genes in nine bacterial genomes, including essential and non-essential genes, and analyzed the distribution of their Hurst exponents based on a normal distribution model.
Analysis objects | NCBI RefSeq access number | Gene number (Full-gene-set) | Gene number (Essential) | Gene number (Actual)a |
---|---|---|---|---|
a Some of the essential gene names listed in the DEG were not found in the corresponding genome file from NCBI. | ||||
Escherichia coli K-12 MG1655 | NC_000913 | 4145 | 712 | 700 |
Acinetobacter ADP1 | NC_005966 | 3307 | 499 | 493 |
Mycoplasma genitalium G37 | NC_000908 | 475 | 381 | 364 |
Staphylococcus aureus NCTC 8325 | NC_007795 | 2891 | 351 | 351 |
Bacillus subtilis 168 | NC_000964 | 4176 | 271 | 244 |
Francisella novicida U112 | NC_008601 | 1719 | 392 | 386 |
Mycobacterium tuberculosis H37Rv | NC_000962 | 3988 | 614 | 604 |
Salmonella typhimurium LT2 | NC_003197 | 4423 | 230 | 228 |
Staphylococcus aureus N315 | NC_002745 | 2583 | 302 | 297 |
(i) Nucleotide sequences were transformed into digital sequences by expressing each nucleotide as a digital number. The four nucleotides, A, G, C and T, were assigned the digital numbers 0, 1, 2 and 3, respectively.16 For example, a DNA fraction ATTCAC was transformed to 033202.
(ii) The subsequences corresponding to each gene were selected and their Hurst exponents were calculated based on the MP method in the R software.
(iii) The distribution properties of the Hurst exponents of the full-gene-set, the essential genes and the nonessential genes of each organism were analyzed using SPSS software (IBM, Armonk, NY) as follows.
First, the distributions of the Hurst exponents of the full-gene-set, the essential genes and the nonessential genes of each organism were analyzed based on the quantile–quantile (Q–Q) plot. This is a graphical method for comparing two probability distributions, usually the sample distribution function and a theoretical distribution function. If the data follow the assumed theoretical distribution, the points on the Q–Q plot will fall approximately on a straight line; otherwise the points will depart from a straight line.
In our study, the straight line in a Q–Q plot represents the expected normal statistics. If the investigated data follow a normal frequency distribution, a plot of the data against the expected statistics should produce a straight line. All the plots were produced by the SPSS Analyze > Descriptive Statistics > Q–Q procedure.
Next, the significance levels of the Hurst exponents of the full-gene-set, the essential genes and the nonessential genes of each organism, based on a normal distribution, were calculated based on the Kolmogorov–Smirnov (K–S) test. This procedure provides a significance level, which is then used for quantitative evaluation of whether the datasets were significantly different from an assumed theoretical distribution.
In our study, the assumed distribution is normal distribution. A normal distribution hypothesis is refused when the significance level is less than 0.05, and is accepted when the significance level is greater than or equal to 0.05. All K–S tests were produced by the SPSS Analyze > Nonparametric Tests > 1-Sample K–S procedure.
Fig. 1 Q–Q plots of the Hurst exponents of the full-gene-set and the essential genes of the nine objects. A, C, E, G, I, K, M, O and Q show Q–Q plots of the Hurst exponents of the full-gene-set of the nine objects, respectively. B, D, F, H, J, L, N, P and R show Q–Q plots of the Hurst exponents of the essential genes of the nine objects, respectively. The quantiles of the observed organism's distribution and normal distribution are plotted on the horizontal and vertical axis in each plot, respectively. |
Analysis objects | NCBI RefSeq access number | Full-gene-set | Essential genes | Nonessential genes |
---|---|---|---|---|
Escherichia coli K-12 MG1655 | NC_000913 | 0.000 | 0.575 | 0.000 |
Acinetobacter ADP1 | NC_005966 | 0.052 | 0.492 | 0.083 |
Mycoplasma genitalium G37 | NC_000908 | 0.996 | 0.957 | 0.576 |
Staphylococcus aureus NCTC 8325 | NC_007795 | 0.000 | 0.099 | 0.002 |
Bacillus subtilis 168 | NC_000964 | 0.004 | 0.425 | 0.005 |
Francisella novicida U112 | NC_008601 | 0.048 | 0.320 | 0.074 |
Mycobacterium tuberculosis H37Rv | NC_000962 | 0.000 | 0.709 | 0.000 |
Salmonella typhimurium LT2 | NC_003197 | 0.003 | 0.509 | 0.005 |
Staphylococcus aureus N315 | NC_002745 | 0.005 | 0.287 | 0.017 |
The significance levels of the Hurst exponents for essential genes were greater than those of the corresponding full-gene-set for eight of the nine organisms studied. For example, the significance levels of E. coli strain K-12 MG1655 were <0.001 and 0.575, respectively. Similarly, the significance levels of Bacillus subtilis strain 168 were 0.004 and 0.425, respectively. These results indicate that a normal distribution exists for the Hurst exponent of the essential genes in these organisms. By contrast, in Mycoplasma genitalium strain G37 the significance levels for the essential genes were lower than for the full-gene-set, which we propose is because of the smaller genome of this organism (see Table 1).
We also observed that most of the significance levels of the Hurst exponents of the nonessential genes remained unchanged or increased slightly compared with those of the corresponding full-gene-set. For example, the significance levels of E. coli strain K-12 MG1655 remained unchanged (<0.001). Similarly, the significance levels of B. subtilis strain 168 showed a slight change from 0.004 to 0.005, indicating that a normal distribution hypothesis is refused for the Hurst exponent of the nonessential genes, contrary to the change in its essential genes. The significance levels of Acinetobacter strain ADP1 changed slightly, from 0.052 to 0.083. Although a normal distribution hypothesis is accepted in both states in this organism, the degree of significance is far lower. The results obtained with M. genitalium strain G37 did not follow the same trend, with a lower significance level being observed for its nonessential genes. Again, we propose that this is caused by the smaller size of the genome in this organism.
We found that in most analyzed organisms the significance levels of the Hurst exponents of the essential genes were significantly higher than the corresponding full-gene-set, along with an increase in the degree of significance of the normal distribution. Correspondingly, in most analyzed organisms, the significance levels of the Hurst exponents of the nonessential genes remained unchanged or increased only slightly. In these cases, the degree of significance of the normal distribution also remained unchanged or increased only slightly. One organism that did not follow this trend was M. genitalium strain G37. We propose that the different features observed in this organism are derived from its small genome. Since the majority of free-living organisms contain much larger genomes, we suggest that the features of this organism could be used to determine the restraints of bacterial essential genes for use in pre- or post-design checking, especially in computer-aided design.
Our study represents the first stage of an investigation into the common features of essential genes and many issues remain to be resolved. For instance, the necessity of many genes remains to be determined,18,19 and these findings may affect the results of statistical analyses. In this study, we used only nine prokaryotes from the DEG for our analysis. Further studies using a larger number of organisms, with more detailed essential gene information, are now needed to verify and improve the proposed method. In this regard, improvements in the essential gene database will aid this research further.
This journal is © The Royal Society of Chemistry 2012 |