Jian Lia,
Ding-Ping Baia and
Xi-Feng Zhang*b
aFujian Key Laboratory of Traditional Chinese Veterinary Medicine and Animal Health, Fujian Agriculture and Forestry University, Fuzhou 350002, China
bCollege of Biological and Pharmaceutical Engineering, Wuhan Polytechnic University, Wuhan 430023, China. E-mail: zhangxf9465@163.com
First published on 10th July 2019
Due to the lack of a Panax japonicus var. major reference genome, we assembled a reference transcriptome from P. japonicus C. A. Mey transcriptome sequencing data, and 203283 unigenes were obtained. In this study, with the assistance from the Trinity, Bowtie2 and SAMtools softwares, 218465 single nucleotide polymorphisms (SNPs) were identified by mapping the Illumina sequences to the reference transcriptome. The SNP forms included 126262 transformations and 92203 transversions. A large number of SNP loci were associated with triterpenoid saponin synthesis: 54 SNPs were associated with cytochrome P450, one with glycosyl transferase and 94 with the biosynthesis of the triterpenoid saponin backbone.
The availability of high-throughput sequencing methods has led to the discovery of thousands to millions of SNPs in diverse organisms, particularly humans, model experimental organisms and agriculturally important plants and animals. Since SNPs provide a powerful tool for the discovery of high-risk groups, identification of disease genes, design and testing of drugs and basic biological research, they have become important in the application of the Human Genome Project.3
Panax japonicus var. major, a perennial herb in the Araliaceae family, is mainly distributed in Shaanxi, Gansu, Anhui, Zhejiang, Jiangxi, Fujian, Hunan, Hubei, Guangxi, Tibet and other places. It is a traditional medicine widely used in China. It promotes blood circulation and has anti-inflammatory and anti-oxidant activities; it is also responsible for hemostasis. Moreover, it is used to treat a variety of diseases, as documented in the Pharmacopoeia of the People's Republic of China.4 Panax japonicus var. major, P. ginseng C. A. Mey, P. quinquefolius and P. pseudoginseng are closely related, and all share similar chemical compositions: mainly, saponins, polysaccharides, volatile oils, amino acids, trace elements and many types of active components.5
Our main purpose was to develop and examine the SNP markers of P. japonicus var. major to enhance and accelerate its breeding via genomic selection. To date, the reference sequence to P. japonicus var. major, which is needed for SNP loci identification of the interested gene, has not been completed. Thus, we used the transcriptome data of the traditional Chinese medicinal plant P. japonicus, which is closely related, as a reference sequence for mapping P. japonicus var. major.6
All (≥200 bp) | ≥500 bp | ≥1000 bp | N50 | N90 | Total length | Max length | Min length | Average length | |
---|---|---|---|---|---|---|---|---|---|
Transcript | 531296 | 296826 | 173132 | 1472 | 380 | 488565172 | 16000 | 201 | 919.57 |
Unigene | 188914 | 62315 | 29425 | 941 | 258 | 117148035 | 16000 | 201 | 620.11 |
Database | Match number | Match rate (%) |
---|---|---|
Total unigenes | 62240 | 100.00 |
Nr | 32003 | 51.42 |
Swiss-Prot | 18455 | 29.65 |
KOG | 10381 | 32.44 |
GO | 21278 | 34.19 |
KEGG | 8369 | 13.45 |
Unknown | 30237 | 48.58 |
To eliminate the influence of the differences in the lengths of the genes and sequencing depths, RPKM was used to calculate the level of gene expression. In measuring the amount of gene expression, if calculating the amount by mapping to the read number, statistics is not a director. This is because in random sampling, longer gene sequences are more likely to be extracted than shorter gene sequences. Thus, sequencing would falsely indicate a high expression of longer genes. The RPKM value corrected this for the P. japonicus transcriptome expression quantity. Thus, the RPKM value was applied as an index to select the validation of SNP loci with an RPKM threshold value ≥ 3.8 Based on the Qual value in the VCF file of SNP calling and RPKM value, 10 variant loci were selected for verification. Due to the lack of reference genome information, the construction of the transcriptome model was very difficult. Especially, the stitching accuracy and splicing length of the sequencing technology greatly depend on the sequencing depth. At present, China has launched a medicinal plant transcriptome study; however, depth of sequencing and the number of genes detected are required for improvement. In this study, we sequenced rhizomes of P. japonicus var. major considering the limitation of Panax genus transcriptome data and that there is no public reference sequence for SNP. Panax japonicus var. major is a variation of the traditional Chinese medicine plant P. japonicus; therefore, we used P. japonicus unigenes as the reference sequences for P. japonicus var. major. This is because the difference between different transcripts can be detected.9,10 The Bowtie2 software was used to compare the P. japonicus var. major and P. japonicus unigenes, and the result indicated an average matching rate of around 70%. As the congener plant, the evolution type of P. japonicus var. major was in ancient groups, and the evolution relationship with the species in this group such as P. ginseng, American ginseng, and P. notoginseng plants was established. Although the biosynthesis of saponins is unclear, the key functions in the level of the genes in sequence are quite similar. Squalene epoxidase (comp164607_c0_seq3), dammarenediol-II-synthase (comp159106_c2_seq21) and beta-amyrin synthase (comp158446_c0_seq8) genes are important genes of P. japonicus var. major. Their respective DNA sequences showed 99, 99 and 98% identities with those of P. ginseng; 99, 99 and 94% with those of American ginseng; and 95, 95 and 95% with those of P. notoginseng. However, cytochrome P450 and glycosyl transferase with modifying functions, both in the form of a gene family, exist in plants, and there is no strict sequence consistency between different species. For example, the 18 known glycosyl transferase genes showed only 63% identity of their genetic sequences among ginseng, American ginseng and P. notoginseng. Saponins in different plant metabolic regulations are generally visible. Thus, the matching rate at 70% was available.
With the application of the SAMtools software, there were 371358 SNPs in the transcriptome sequencing of P. japonicus var. major. To ensure the accuracy of SNP loci, screening SNPs should ensure that the coverage of two transcripts is greater than the sum of 20 contigs and candidate SNP loci have at least 5 bp of conserved sequences on both sides. The large-scale high-throughput sequencing resulted in a total of 371358 variant loci. According to the above conditions, the screening of candidate SNPs determined a total of 218465 SNP loci, including 126262 transitions and 92203 transversions.11 In the transition loci, the T/C and A/G transitions represented 63165 and 63097 loci, respectively. In the transversion loci, the A/T, G/T, A/C, and G/C transversions represented 29574, 22910, 22750 and 16969 loci, respectively (Fig. 2). The functional annotations of unigenes with 10 SNP loci are summarized in Table 3. In the Nr database, six of the 10 SNP loci were functionally annotated as cytochrome P450, which is a terminal oxygenase and participates in the biological internal sterol hormone synthesis.12 Cytochrome P450 can affect metabolism and pharmacodynamics, and the SNPs can cause an alteration (loss or gain) of the functions.13 The SNPs of cytochrome P450 will be a key factor in influencing the function of P. japonicus var. major in therapy processing.
Query name | SNP form | Annotation | RPKM |
---|---|---|---|
Comp179123_c0_seq1_zzs | T/G | Cytochrome P450 (Panax notoginseng) | 27.4 |
Comp179123_c0_seq1_zzs | C/A | Cytochrome P450 (Panax notoginseng) | 27.4 |
Comp171079_c0_seq17_zzs | T/G | Cytochrome P450 (Panax ginseng) | 38.2 |
Comp171079_c0_seq17_zzs | A/G | Cytochrome P450 (Panax ginseng) | 38.2 |
Comp170147_c0_seq6_zzs | C/A | Cytochrome P450 (Panax notoginseng) | 4.65 |
Comp167639_c0_seq17_zzs | A/G | HMG-CoA (Eleutherococcus senticosus) | 12.17 |
Comp167465_c1_seq6_zzs | T/A | Squalene synthase (Panax notoginseng) | 19.44 |
Comp178663_c2_seq2_zzs | T/A | Acetyl-CoA,C-acetyltransferase protein (Camellia oleifera) | 77.47 |
Comp178663_c2_seq2_zzs | G/A | Acetyl-CoA,C-acetyltransferase protein (Camellia oleifera) | 77.47 |
Comp160382_c0_seq3_zzs | A/C | Cytochrome P450 (Panax notoginseng) | 6.49 |
Squalene synthase had one annotation in the Nr database. The syntheses of triterpenoid saponins, sterols, cholesterol and other terpenes in Panax are all through squalene synthase catalysis.
The gene chip method is also currently popular, in which the gene chip is a probe to sample for mRNA sequence information using known sequence mRNA hybridization. To date, the mRNA of P. japonicus var. major has not been reported, and new mRNA cannot be detected without the corresponding gene chip probe sequences. Also, Panax japonicus var. major has no reference sequence; thus, we compared its gene sequence with P. japonicus. This not only resulted in the identification of a large number of SNP loci, but also provided evidence of the genetic and evolutionary relationship between the two species.
At present, there are lots of SNP calling softwares, which include GATK and SAMtools. Although the description of the SNP results of VCF files in GATK gives the best support, the function of SAMtools is more powerful. For SNP calling, SAMtools and GATK could be used together to finish the work accurately.
Irrespective of variation calling or ChIP-seq, the first step of data analysis is to compare the reads to the genome. The premise of the work is reads being mapped successfully. However, some reads could not be identified in the reference genome and were termed ‘unmapped reads’ because of the individual differences, differences between reference genomes, and the quality of the reference genome itself. Normally, these types of data would be removed, but they also contain a large amount of sequence information and are worthy of analysis. Individual data was compared with reference genome sequencing by the assembly software to Unmapped reads splicing locally. Comparing the contigs and database, reads reached the purpose of gene structure prediction and gene functional annotation. The analysis process is exhibited in Fig. 3. Mace et al. identified new genes in sorghum through the assembly of unmapped reads and the achieved results showed that their new breed guinea-margaritiferums is a sorghum variety containing mostly new genes; its genetic diversity is unique and it has great research value.14 Our database contained many unmapped reads and further analysis will be conducted to verify more valuable information from them.
There are many variants of calling software, and SAMtools and GATK (Genome Analysis Toolkit) are currently the most popular to discover SNPs and INDELs (insertion and deletions), respectively.21,22 GATK was developed in the One-Thousand Genome project for genome analysis. The SAMtools software has two parts, namely, SAMtools and BCFtools; the former has a subcommand ‘mpileup analysis’ that compares the results of reference sequence base sites, produces a BCF file and uses BCFtools for SNP/INDEL calling.7 After using mpileup, the parameter ‘-g/-v’ generates a BCF file (a VCF (variant call format) binary file). If this parameter is not used, it generates a text file, which statistically compares information of each base site in the reference sequence. BCFtools was used to filter the results of variants and obtain a reliable result. According to the quality of the sixth column value in the VCF file, we could also write a new program for filtering.
This journal is © The Royal Society of Chemistry 2019 |