Stefan
Graw
a,
Kevin
Chappell
a,
Charity L.
Washam
ab,
Allen
Gies
a,
Jordan
Bird
a,
Michael S.
Robeson
II
*c and
Stephanie D.
Byrum
*ab
aDepartment of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, 4301 West Markham Street (slot 516), Little Rock, AR 72205-7199, USA. E-mail: sbyrum@uams.edu; Fax: +1 501 526 7008; Tel: +1 501 686 5783
bArkansas Children's Research Institute, 13 Children's Way, Little Rock, AR 72202, USA
cDepartment of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA. E-mail: mrobeson@uams.edu; Fax: +1 501 526 5964; Tel: +1 501 526 4242
First published on 21st December 2020
With the advancement of next-generation sequencing and mass spectrometry, there is a growing need for the ability to merge biological features in order to study a system as a whole. Features such as the transcriptome, methylome, proteome, histone post-translational modifications and the microbiome all influence the host response to various diseases and cancers. Each of these platforms have technological limitations due to sample preparation steps, amount of material needed for sequencing, and sequencing depth requirements. These features provide a snapshot of one level of regulation in a system. The obvious next step is to integrate this information and learn how genes, proteins, and/or epigenetic factors influence the phenotype of a disease in context of the system. In recent years, there has been a push for the development of data integration methods. Each method specifically integrates a subset of omics data using approaches such as conceptual integration, statistical integration, model-based integration, networks, and pathway data integration. In this review, we discuss considerations of the study design for each data feature, the limitations in gene and protein abundance and their rate of expression, the current data integration methods, and microbiome influences on gene and protein expression. The considerations discussed in this review should be regarded when developing new algorithms for integrating multi-omics data.
Proper integration of multi-omics approaches has allowed deeper insights into disease etiology, such as unveiling the myriad ways in which the microbiome may play a part in mitigating or enhancing disease risk. This case can be exemplified in regard to incomplete breakdown of bisphenol A (BPA), a mass-produced chemical that is widely used in food packaging, plastics, and resins. BPA has become a growing public health concern as BPA is an endocrine disruptor (as reviewed in Yu 20193). Thus, research into the fast and complete degradation of BPA, and other compounds via microbial means is of great interest. Yu and colleagues (2019)3 were able to effectively combine multi-omics data to analyze a microbial community's ability to break down bisphenol A (BPA) products. Though prior research had already discovered the microbes’ ability to break down BPA, the interactions that allowed this reaction were yet unknown. Through a clever multi-omics design, the authors were able to use three major types of integrated analyses to identify differences in encoded and expressed microbial functions that were involved in the BPA-degrading microbial community.3
Another example, Poore et al. (2020) leveraged multi-omics and machine learning tools, to detect microbial biomarkers from blood and tissues, serving as a great example of microbiome-informed oncology.4 Here the research team was able to discriminate among healthy and cancer-free individuals as well as between multiple cancer types using plasma-derived, cell-free microbial nucleic acids. Finally, we refer the reader to other reviews about the importance of integrating microbes into multi-omics studies.5–10
There is a growing appreciation for multi-omics studies in context of therapeutic treatments. However, the methodologies are challenging for a variety of reasons. Each biological regulatory feature has technical hurdles to overcome due to sample preparation, sequencing platforms and depth, limits in instrumentation, and dynamic range.7,11 New data integration algorithms are being developed at a rapid pace. In this review, we discuss the background of cellular processes, current data integration methodologies, the considerations for multi-omics study design, and future directions.
Different organisms have varying numbers of genes and proteins. For instance, there are approximately 4300, 6000, and 25000 genes in the E. coli, S. cerevisiae, and H. sapiens genomes, respectively.12 This leads to approximately 2400 to 7800, 15000, and 300000 mRNA molecules per cell for E. coli,13S. cerevisiae,14 and H. sapiens,15 respectively. Mitochondrial transcripts can account for approximately 20% of polyadenylated RNA. Other high abundant transcripts include those that encode for ribosomal proteins and proteins involved in energy metabolism.16 It is important to note in sequencing platforms that only a fraction of all transcripts in a sample are actually sequenced and the potentially large number of transcript isoforms generated by alternative splicing events presents another challenge when integrating gene and protein level expression.17 The transcript isoforms may also change across biological conditions.18 An overview of the complexity of DNA, DNA methylation, histone post-translational modifications, mRNA, and proteins in humans is depicted in Fig. 1.
The estimated number of proteins in a cell is around 2.36 × 106, in E. coli and about 2.3 × 109 in H. sapiens HeLa cells.19 Within the vast number of total proteins in a cell, the most abundant proteins can make up 5–10% of protein content and consist of ribosomal proteins, acyl carrier protein (ACP) (functions in fatty acid biosynthesis), chaperones and folding catalysts, proteins of glycolysis (backbone of energy and carbon metabolism), and structural proteins such as actin. Transcription factors are low abundant proteins and range from 1–103 copies per cell in bacteria and 103–106 in mammalian cells. The most abundant proteins usually have many thousands of copies in bacteria and many millions in mammalian cells. The number of genes regulated by a transcription factor depends on its concentration.19 The protein content depends on the growth conditions and gene induction. Finally, this can become more complicated given the ratio of microbial-to-host cell count, which can depend on host cell type, and other factors.20
Sequencing technologies for various omics platforms only capture a snapshot of what is happening in a population of cells at one point in time due to limitations in instrument detection, dynamic range, and the lifetime expression of the molecules. For instance, the lifetime expression of mRNA transcripts and proteins are vastly different. The median lifetime of an mRNA in E. coli is 5 min, 20 min in budding yeast, and 600 min for H. sapiens.19 However, the lifetime of proteins is approximately 1–2 days. The rate of transcription and translation varies among organisms (E. coli: 10–100 nucleotides (nt) per second (s) and 10–20 amino acids (aa) per s. H. sapiens: 6–70 nt per s and 2 aa per s; rate of transcription and translation, respectively) (Fig. 1). For E. coli a single mRNA transcript can give rise to 10–100 proteins before being degraded. Given this information we can see that there will be an increased chance of detection of proteins with a longer life span, conflating our choice of omics platforms and the resultant interpretations of cellular processes.
It is important to recognize the biological complexity of organisms, dynamic range of molecules, sequencing limitations, as well as the lifetime of expression of those molecules when considering a data integration study design, developing a new algorithm, and when interpreting the results.
Microbiota and their metabolites can affect the host epigenetic landscape, by directly modifying histones, altering DNA methylation profiles, and influencing the nature of noncoding RNAs (Fig. 1). For example, histones can be modified by microbiota by altering the activity of histone modification enzymes, and the levels of the enzymes substrates.27–29
Microbiota can also affect the therapeutic nature of drugs. Many prodrugs, i.e. a drug that must metabolically converted in order to become pharmacologically useful, may remain inactive (i.e. the microbiota that mediate the conversion of the prodrug to its active form are not present), or the drug/prodrug, may not become bioavailable to the host, as a result of degradation by the hosts microbiota.30 Moreover, patients taking NSAIDs (non-steroidal anti-inflammatory drugs), may promote the preponderance of antibiotic resistant bacteria as 24% of tested over-the-counter NSAIDs inhibited the growth of at least one microbe in vitro.31 These metabolomic effects, raises concerns about potential side-effects of therapeutic drugs, or other diet and treatment regimens, intended to be used on humans and agricultural systems. For example, antibiotics can eliminate histone deacetylase (HDAC) inhibitor-producing microbes. These microbes, when present, can augment regulatory T (Treg) cells, which aids in anti-inflammatory processes.32
This means that the diversity of microbial metabolic pathways, and their impact on drug pharmacokinetics and pharmacodynamics,33,34 may partly explain the variation to drug responses between individuals and populations. Therefore, therapeutic treatments that involve the microbiome, may have to be regionally tailored.30,35,36
Histones can undergo both variant replacement and post-translational modification (PTM), together these form the “histone code”. These local arrangements can affect chromatin structure in such a way that leads to the activation or repression of transcriptional activity.37,38 Thus microbes, through diet, have the ability to modify methylation and PTM profiles of the host, and can also affect the generation of short-chain fatty acids (SCFAs) through the fermentation of dietary carbohydrates. SCFAs, such as butyrate and acetate, can inhibit deacetylase levels. Meaning that chromatin structure becomes increasingly relaxed due to acetylation driving increased transcriptional activity.26 In fact, it has been shown that microbes can affect host tissue acetylated and methylated chromatin states in a site-specific and combinatorial fashion and even impact host developmental and metabolic phenotypes.37–39
Modelling the development of the microbiome and its commensurate ontogenetic changes of the host, are increasingly being considered when trying to interrogate host health and therapeutics.40 Many microbial ecological principles such as community assembly are being brought to bear to investigate these processes.41,42 These changes can be exemplified through host immune maturation, considering that the host immune system must not only be able to recognize “self” antigens, but also those of symbiotic microbes. How microbes influence the expression of the major histocompatibility complex (MHC), or how host heterozygosity in turn affects the diversity of the microbiota through MHC, is largely unknown and is an active area of study.26,43 Finally, the role of microbes as they relate to cancer and immune treatments are increasingly becoming targets for the development of therapeutic strategies.44,45
Proteomics, in combination with other omics strategies have been used to interrogate disease processes. However, if we do not take into account the effects of microbiota (i.e. the entirety of the holobiont), then we may miss meaningful insights to develop potentially therapeutic treatments. Particularly those related to metabolic disorders (e.g. obesity), or the systemic effect of metabolites (e.g. bile acids) on organ systems.46 There is far more variation of our “second genome” that can be leveraged for human benefit compared to our own.47
Resource | Data type | Link | Ref. |
---|---|---|---|
SILVA is a resource of databases of aligned ribosomal RNA (rRNA) gene sequences from the bacteria, archaea and eukaryota domains. | Gene sequences of 16S for prokaryotes and 18S for eukarya | https://www.arb-silva.de/ | 121 |
Ribosomal database project: aligned and annotated rRNA gene sequence data | 16S rRNA sequences | http://rdp.cme.msu.edu/ | 122 |
Greengenes is a dedicated full-length 16S rRNA gene database that provides users with a curated taxonomy based on de novo tree inference. | Taxonomy based on the 16S rRNA gene | https://greengenes.secondgenome.com/ | 123 |
Genome Taxonomy Database is an initiative to establish a standardized microbial taxonomy based on genome phylogeny. The genomes used to construct the phylogeny are obtained from RefSeq and Genbank. | A comprehensive and phylogenomic-based taxonomy for bacterial and archaeal taxa | https://gtdb.ecogenomic.org/ | 52 and 53 |
Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data | Protein sequence and annotation database | https://www.uniprot.org/ | 124 |
NIH National Center for Biotechnology Information (NCBI) GenBank is an annotated collection of all publically available DNA sequences. Complete bimonthly release updates are available. Data is exchanged daily with the DNA DataBank of Japan and the European Nucleotide Archive. | Genomic sequence and annotation | https://www.ncbi.nlm.nih.gov/genbank/ | 125 |
NIH/NCBI Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins | Genomic, transcriptomics, and proteomic sequence and annotation | https://www.ncbi.nlm.nih.gov/refseq/ | 126 |
University of California Santa Cruz (UCSC) Genome Browser for exploring genome sequences and annotation. GenBank updates for mRNA, RefSeq, and EST data occur on a semi-quarterly basis. | Genome sequence and annotation database | http://genome.ucsc.edu/ | 127 |
NIH National Human Genome Research Institute Encyclopedia of DNA Elements (ENCODE) Consortium project uses Reference Genomes from NCBI or UCSC | DNA methylation, and immunoprecipitation (IP) of proteins that interact with DNA and RNA, modified histones, transcription factors, chromatin regulators, and RNA-binding proteins. Genome sequence and annotation database. | https://www.encodeproject.org/ | 128 |
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Updates are released every 2–3 months. | Genome sequence and annotation, gene models, transcriptional data, genetic variation and comparative analysis | http://ensembl.org/ | 129 |
The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that molecularly characterized over 20000 primary cancer and matched normal samples spanning 33 cancer types. This a joint effort between the National Cancer Institute and the National Human Genome Research Institute. | Individual patient tumor samples: DNA, RNA, protein, epigenetic changes | https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga | 130 |
Cancer Cell Line Encyclopedia (CCLE) is a collaboration between the Broad Institute, and the Novartis Institutes for Biomedical Research and its Genomics Institute of the Novartis Research Foundation to conduct a detailed genetic and pharmacologic characterization of a large panel of human cancer models. CCLE contains genomics data and visualization for over 1400 cell lines. | Copy number, mRNA expression (Affy), RPPA, RRBS, and mRNA expression (RNAseq) | https://portals.broadinstitute.org/ccle | 131 |
Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is a community resource project. TARGET is organized into a collaborative network of disease-specific project teams with the goal of identifying molecular changes that drive childhood cancers. | Clinical information, gene expression, miRNA expression, copy number, sequencing data for cancers | https://ocg.cancer.gov/programs/target | Initiative phs000218 |
Omics Discovery Index (OmicsDI) an open-source platform that enables access, discovery and dissemination of omics data sets. | Genomics, transcriptomics, proteomics, metabolomics | https://www.omicsdi.org/ | 132 |
Multi-Omics Profiling Expression Database (MOPED) is a repository for multi-omics data of human and model organisms. | Transcriptomics and proteomics data and visualization | https://omictools.com/moped-tool | 133 |
ProteomeXchange (PX) Consortium consists of PRIDE, PeptideAtlas, PASSEL, MassIVE and jPOST. Devoted to mass spectrometry (MS)-based proteomics data. | Proteomics data sets | http://www.proteomexchange.org/ | 134 and 135 |
Apart from selecting a marker gene and appropriate database, researchers also have a choice between sequencing methods and platforms. Due to limitations of short-read platforms such as Illumina and Ion Torrent, researchers must select between variable regions of the ∼1500 bp 16S rRNA gene. Each variable region provides a different level of sensitivity and specificity depending on microbial community composition. This is why preliminary amplicon surveys often compare a collection of primer sets and variable regions. The combination of the primer set and the amplicon region that best differentiates among the common taxa in the study, is then chosen. Alternative long-read platforms have recently been adapted to deliver high-throughput full-length 16S rRNA for researchers that need taxonomic resolution beyond the genus to family level typically provided by short-read technologies.55
Current metagenomic analysis techniques have allowed researchers to obtain partial and complete draft genomes from environmental/host-derived samples given sufficient sequence coverage. This coverage factor is highly dependent on the species evenness and richness. Researchers using these techniques can investigate potential functional differences of a collection of metagenome assembled (draft) genomes. However, often they have to use concentrated universal proteins to place these genomes in a phylogenetic context because of the difficulty of assembling and correctly binning highly conserved genes like the ribosomal subunit genes. Combined universal marker genes are used to construct the phylogeny from genomes assembled from environmental and host-derived sequences along with a minority of familiar microbial genomes from culture collections. The sudden rush of sequencing microbial genomes has necessitated the construction of easy-to-use wrappers and pipelines to aid biologists in learning how to approach the analysis of their metagenome data, either in whole or in part. Some great examples of such tools are, QIIME 2,56 metaWRAP,57 Sunbeam,58 SqueezeMeta,59 metAMOS,60 mg-RAST,61 IMG/M,62 Anvi’o,63 MicrobiomeAnalyst,64 and the variety of tools within the biobakery65 collection (e.g. MetaPhlan2,66 PhyloPhlan,67 HUMAnN,68 LEfSe69), among others (see ref. 70 for a review these and many other meta’omics tools). Biomedical researchers wading into the depths of microbial ecology looking to integrate disease metrics, host proteomics, and microbial diversity should be aware of the various databases, curatorial rigor, and the limitations of the sequencing platform they choose.
Application | Recommended coverage (×) or reads (millions) | Ref. | ||
---|---|---|---|---|
Illumina | PacBio | Nanopore | ||
Whole genome sequencing | >15× | >35× | >40× | 75 and 137 |
Whole exome sequencing | >15× | >35× | >40× | 75 and 137 |
Transcriptome sequencing (mRNA; differential expression analysis) | 10–30m | >30m | >30m | 138 and 139 |
Transcriptome sequencing (alternative splicing; allele specific expression) | 50–100m | 50–100m | 50–100m | 139 |
miRNA sequencing | >30m | >30m | >30m | 138 |
16S metagenomics | >100× | >100× | >100× | |
Shotgun metagenomics | >80m | >80m | >80m | 140 |
Histone ChIP-seq | >20m for narrow peak, >45m for broadpeak | >20m for narrow peak, >45m for broadpeak | >20m for narrow peak, >45m for broadpeak | 141 |
Transcription factor ChIP-seq | >20m for narrow peak, >45m for broadpeak | >20m for narrow peak, >45m for broadpeak | >20m for narrow peak, >45m for broadpeak | 141 |
ATAC-seq | >25m | >25m | >25m | 141 |
DNA methylation sequencing (RRBS per strand) | >15× | >15× | >15× | 142 |
Error rates and read lengths vary between DNA sequencing technologies. Illumina short read sequencing (i.e. Hiseq, Miniseq, etc.) typically have very low error rates, at about 0.25% per base, but are sensitive to low diversity libraries, as is the case with applications such as 16S metagenomics and targeted gene approaches. Long read technologies have higher error rates, ranging from 13–15% for PacBio and 5–20% for Oxford Nanopore instruments.75,76 Read length for Illumina platforms have a maximum length of 600 bases but long read technologies commonly achieve 10–30 kb for a single read.77 Optimal read length is also dependent on the application. Where most sequencing experiments can collect suitable information with 150–300 base pair read lengths, there are exceptions. Illumina's 16S Metagenomics protocol requires 2× 300 base pairs. For whole genome sequencing (WGS), the longest read possible is optimal but with long read technologies, the error rate increases with the length. Many researchers have combined long read and short read sequencing to “fill gaps” with WGS. Due to the fairly recent advent of long read sequencing technology, information on optimal long read lengths for applications other than WGS is sorely lacking but Illumina short read sequencing is rich in optimal read length recommendations.78
Long read sequencing technologies such as ONT and PacBio have already ushered in significant improvements in both the amplicon and metagenomic sequencing space. From high resolution analysis of the full length 16S gene,55 the entire rRNA operon,79 to improving the ability to close entire microbial genomes.80 For an in-depth overview on these long-read sequencing technologies, see Amarasinghe et al.77
Mass spectrometers have also improved by increasing sequencing depth capabilities over the past 5–10 years. The technology has advanced from sequencing roughly 3000 proteins in a cell line experiment using older LTQ mass spectrometers to routinely sequencing 8000–10000 proteins using newer Orbitrap Lumos and Orbitrap Eclipse mass spectrometers. Most proteomics experiments are performed using data dependent acquisition (DDA) mode. In this method, the top 20 most abundant peptides in the MS1 scan that are eluted from a liquid chromatography (LC) column are selected for fragmentation in the orbitrap in order to generate the peptide sequence MS2 scan. The complexity of the sample mixture highly influences the sequencing depth and how many proteins will be identified. Understanding the protein abundance and make-up of the samples is critical. If transcription factors are the target molecules, then some method of removing highly abundant proteins prior to mass spectrometry may be necessary. This is especially critical for serum and plasma samples that have high abundant molecules, such as albumin and hemoglobin. Otherwise, the mass spectrometer will sequence thousands of molecules of albumin and miss the most interesting low abundant proteins.81
The latest mass spectrometry technology utilizes data independent acquisition (DIA) to sequence all of the peptides from the MS1 scan as they elute from the LC column as opposed to DDA methods that only sequence the top most abundant peaks. DIA methods are beneficial over DDA for complex mixtures, such as in the serum example above. This method helps to overcome complex mixtures that are highly influenced by high abundant proteins.82–84
In addition to shotgun sequencing for the host genes and/or proteins, we can also utilize shotgun sequencing for the microbiome. Shotgun metagenomics/metaproteomics may only sample the dominant microbiota when the sequencing depth is very shallow. A major challenge of shotgun sequencing the microbiome is the difficulty in assembling genome fragments due to under sampling, it is also just as difficult to piece together peptides for robust protein and taxa identification.
Despite these potential issues, it is possible to sample the microbial proteome in depth from a variety of human body sites and diseases, such as saliva, gut/feces, cervicovaginal, or chronic kidney disease.40,85–87 However, the study/sampling design and analytical approaches one must consider can differ greatly between each study. Several sampling preparation approaches have been shown to enrich microbial biomass ranging from differential centrifugation through double-filtering differential separation. These approaches are often followed by a variety of optimized microbial lysis protocols, typically involving mechanical disruption (e.g. bead beating, sonications), complemented with enzymes (e.g. trypsin) and detergents. Upon successful lysis, it is just as important that remaining enzymes, detergents and salts be removed. For more details see the review by Issa Isaac et al. and Lin et al.7,40 and the references therein.
Another complication for metaproteomics experiments is due to the fact that proteins within the same organism have shared peptide sequences. In order to have confidence in the protein identification, a unique peptide match for the protein should be identified with high confidence. This is made even more complicated when mapping peptide sequences to hundreds of different species that have conserved protein sequences. Mass spectrometry does not sequence proteins, but rather measures the mass-to-charge of peptides and relies on mass spectra matches to a database of protein sequences for protein identification. However, there is hope to make sense of these (Tables 1 and 2).7,40
Curated databases are critical to properly analyze nucleotide and protein sequencing data generated from these various sequencing platforms. The ability to align reads to a reference genome is only as good as the sequence and annotation information present in the reference genome. There are several resources that continually curate and update nucleotide sequence information and annotation including University of California Santa Cruz (UCSC) Genomics Institute genomes, National Center for Biotechnology Information (NCBI) GenBank and RefSeq, Encyclopedia of DNA Elements (ENCODE), and Ensembl to name a few. The Universal Protein Resource (UniProt) contains both Swiss-Prot (manually annotated and reviewed) and TrEMBL (automatically annotated and not reviewed) databases for protein sequence information (Table 1).
Unfortunately, data integration and analysis are very complicated and there currently do not exist many user-friendly tools for researchers with limited bioinformatics backgrounds. Many tools utilize the statistical language R, which requires programming expertise in addition to strong biostatistical knowledge. For example, the R package integrOmics, which combines proteomics, transcriptomics, and pathway analysis on two data sets uses correlation analysis and partial least squares regression.89 The R package mixOmics uses multivariate analysis for data exploration, dimension reduction and visualization.90 Micrographite integrates miRNA and gene expression with pathway analysis91 and iClusterplus92 and LRACluster93 use clustering to integrate methylation and gene expression data (Table 3).
Tool | Purpose | Types of omics data | Pros | Cons | Ref. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Metabolomics | Proteomics | Transcriptomics | Pathway analysis | miRNA | SNP analysis | Microbiome | DNA methylation | Copy number variants (CNV) | Genomics | Visualization | |||||
Metaboanalyst 4.0 | Metabolomics data analysis, interpretation, and integration with other omics data | × | × | × | × | Relatively easy to use. Has a web interface. Basic computer skills. | File size limit (50 Mb), installed version requires Linux and programming experience. | 143 | |||||||
Paintomics 3.0 (web based) | Joint visualization of transcriptomics and metabolomics data | × | × | × | × | Relatively easy to use. Automatic feature name conversion. Has a web interface. Basic computer skills. | Most web applications have a file size limit. | 97 | |||||||
integrOmics (R package) | Integrative analysis of two types of omics datasets | × | × | × | × | Customizable | Requires programming skills in R | 89 | |||||||
Omics Integrator | Maps protein data to other data sets | × | × | × | × | × | × | Easy to use web interface. Basic computer skills. | Most web applications have a file size limit. Local installation requires advanced computer skills | 144 | |||||
mixOmics (R package) | Data exploration, dimension reduction, and visualization | × | × | × | × | × | Customizable | Requires advanced programming skills, programming in R | 90 | ||||||
PARADIGM | Extraction of disease-perturbed sub pathways within pathway networks | × | × | × | × | Uses a combination of algorithms to improve accuracy | Requires advanced programming skills, command line interface, programming in R. Pathways are measured independently, and interactions among pathways are not considered. | 145 | |||||||
Micrographite (R Package) | Pathway analysis of miRNA and gene expression profiles | × | × | × | Customizable, integrates pathway information with predicted and validated miRNA-target interactions. | Requires advanced computer skills, programming in R. | 91 | ||||||||
iClusterplus | Integrative clustering of multiple data sets | × | × | × | × | Customizable. Incorporates flexible modeling of the associations between different data types | Requires advanced computer skills, computationally intensive, limitations in statistical inference, programming skills in R | 92 | |||||||
LRAcluster | Integrative clustering of multiple data sets | × | × | × | × | Fast and efficient unsupervised clustering | Command line interface, requires advanced computer skills. | 93 | |||||||
GENEASE | Disease ontology exploration, analysis, and visualization of multiple databases | × | × | × | × | × | × | × | Web based interface. Uses multiple databases in real time. | Most web applications have a file size limit. | 146 | ||||
ProteoClade | Annotate taxa to proteomics data | × | × | × | Customizable. Can work with large data sets. Targeted and de novo database searches. Good tutorials. | Requires advanced computer skills, programming in Python. | 147 | ||||||||
Qiime2 (q2-micom) | Metabolic modeling | × | × | Customizable, Highly versatile. Good tutorials. | Steep learning curve. Requires advanced computer skills | 148 | |||||||||
Qiime2 (q2-mmvec) | Learning microbiome/metabolic interactions | × | × | Customizable, highly versatile. Good tutorials. | Steep learning curve. Requires advanced computer skills | 149 | |||||||||
Qiime2 (q2-metabolomics) | Tool to import metabolomic data into Qiime2 | × | × | Customizable, highly versatile. Good tutorials. | Steep learning curve. Requires advanced computer skills | 150 |
For both multi-state and meta-dimensional methods, many different algorithms are used, but the most common ones are clustering, network analysis, data reduction (PCA), and Bayesian analysis.94 Ray et al. (2014) used Bayesian analysis to analyze gene expression and methylation data in ovarian cancer using data collected from the Cancer Genome Atlas Project and detected a gene, SPON1, which appears to be regulated by methylation of its CpG site.95 Correlation based analysis can be useful when prior knowledge of biochemical interactions is lacking.96 Regardless of the methodology, appropriate normalization and data filtering is very important as data is being incorporated from multiple sources.
There also exist some web-based tools such as Paintomics97 that attempt to make the data analysis easier but can still be difficult for the inexperienced user and the researcher must have a good working knowledge of their data.98 Further, there are databases that are commonly used in integrated omics analysis, such as the Cancer Cell Line Encyclopedia (CCLE), The Cancer Genome Atlas Program (TCGA), Tumor Alterations Relevant for Genomics-driven Therapy (TARGET), and Omics Discovery Index (OmicsDI)94 (Table 1). CCLE and TCGA have characterized thousands of cancer data sets and can be used for data mining and visualization. TARGET utilizes clinical information and has resources for analytical tools on their websites. OmicsDI provides a platform for searching public and protected data for a large variety of organisms.
Once research hypotheses are clearly defined, a suitable study design is selected that addresses the research hypotheses best. Therefore, several questions need to be evaluated, such as: are one or more intervention groups compared to a control (or themselves), or is an effect evaluated in the same samples before and after intervention? Is an intervention effect over one period of time or will samples be measured at several different time points? Will biological samples be pooled or analyzed individually and what is the scientific justification for it101? Which types of omic platforms will provide the most value101 and how are the multi-omics data going to be integrated? Are samples from the same biological source available for all multi-omics platforms of interest? Ideally, samples for all omic platforms would be collected from the same source. However, this is not always possible due to sample-specific limitations or accessibility and amount of the material.101 For instance, generating multi-omics data from formalin-fixed paraffin-embedded (FFPE) tissue might not be possible for certain omic platforms.101 While there are many questions to be considered during the selection of the experimental design, the deciding factor for the choice of a study design is usually its feasibility and financial limitations.10,102
Following the selection of a study design, available resources need to be allocated between the individual omic platforms.101 This allocation should be guided by the cost and contribution of each individual omic platforms to the multi-omics study as well as the statistical power of each individual omic platforms. Omic platforms with a substantial signal-to-background noise ratio will require less samples and allow for an allocation of more resources to omic platforms with a small(er) signal-to-background noise ratio, as these platforms require more samples to achieve (similar) adequate statistical power. In addition, some omic platforms will also require some internal distribution of resources. For example, when designing an RNA-Seq study the trade-off between the number of samples and sequencing depth will need to be balanced.10
The sample and data collection should be guided by the data analysis to reduce confounding and technical artifacts, such as batch effects.10 These effects can be introduced during steps of the sample collection, preparation and storing (e.g. multiplexing).10,101 While some ad hoc methods attempt to reduce such biases introduced by technical artifacts, they are inferior to a randomized design.98 However, some technical artifacts cannot be avoided, and in these cases it is important to identify and understand such limitations early in experimental design to mitigate and recognize their impact on the results and conclusions.10,101
Due to the complexity and large volume of data associated with multi-omics studies it is crucially important to tailor the statistical analysis to a specific research project.10 A variety of methods for integrating multi-omics data have been proposed and categorized as either supervised, semi-supervised, or unsupervised;103 as well as, conceptual, statistical, correlation, network, and model-based integration.102 The integration and statistical analysis of a multi-omics study depend on the selection of omic platforms and their associated types of data (e.g. count values, percentages). Nevertheless, each analysis and method have its underlying assumptions that need to be verified.98
As in any well-designed study, an initial power calculation is increasingly crucial to evaluate and estimate a sufficient number of samples and avoid a potential waste of resources, especially in such large-scale studies.10 Power is defined as the probability of correctly rejecting the null hypothesis, which is the likelihood of detecting a true signal or effect. A mathematical power calculation is usually impossible due to the complex nature of the study design and data; however, estimating statistical power using simulation is a valuable alternative. The evaluation of statistical power involves four major steps. First, data needs to be simulated and a pilot study, prior knowledge, literature or experts can be beneficial for the generation of realistic data. Next, a true signal needs to be introduced and should be guided by the expected effect size. The data can then be analyzed, and the statistical power can be estimated empirically. In the context of a case-control study, the statistical power is the percentage of correctly identified features out of all features with an introduced effect size.
The statistical power of a study depends on several factors (Fig. 2) of which some can be controlled, while others are fixed due to the study and its design. First, the choice of statistical method used for the analysis. While some tests are more powerful than others, it is important that their assumptions are verified and met. Another factor that influences the statistical power of study is the number of variables measured by the individual omic platform, which is usually dictated by the omic platform.101 For example, genomics typically measures millions of variants,104,105 transcriptomics quantifies tens of thousands of molecules,106 and proteomics107 and metabolomics108,109 profile thousands of molecules. Further, statistical power is affected by the magnitude and prevalence of the effect of the phenotype or exposure (effect size). How distinct is the effect? How substantial is the signal difference between groups? And how many measured variables are affected? Information about the effect size might be available from previous literature or expert knowledge but is often unknown.101 In such cases, a pilot study can assist with estimating the effect size, but these estimates need to be handled with caution due to instability.101,110 Yet another power influencing factor is the homogeneity of the measured values, describing the natural variance of the sample, the precision of the measurement instrument and detection limits. With an increasing variance the statistical power will be reduced. The variance of the samples can be the result of many aspects, such as the sample population selection, choice of tissue type or confounding factors.111 In addition to sample variance inflation, confounding factors can also introduce biases in the data, and therefore it is important to collect sample meta-data to mitigate some effects of confounding.101 Because most of the factors affecting the statistical power of a study are fixed or dictated by the study design, the factor that is most commonly used to adjust the statistical power of a study is the sample size.
Applying power analyses for microbiome data is still a burgeoning field of inquiry and is replete with difficulties.112–115 The types of power analyses, like those outlined above, differ based on the questions being asked of microbiome data. Typically, power analyses of microbiome data center on measures of alpha and beta diversity, and differences in compositional abundances of taxa.114,116,117 Which of these measures to use will depend on the question at hand. How to integrate these into a multi-omics study is still underdeveloped.115
Power and sample size evaluation is a valuable technique during the experimental design of a study to ensure adequate power and sample size. While under- and overpower studies unnecessarily deplete resources, the risk of failure of a study is especially prevalent in underpowered studies. Underpowered studies and studies with improper experimental design are more likely to miss true signals, produce bias results, false positive (type I error) and false negative (type II error) results, which will lead to misinterpretations.10,101,111 Such incorrect inferences will impact the reproducibility, scientific progress and the cost of science.98,118
It is also critical to know what population of regulatory features were captured for sequencing and can be integrated. For example, membrane bound proteins cannot be integrated with gene expression data if membrane proteins were not solubilized during sample preparation prior to performing mass spectrometry. A caveat with mass spectrometry data is the fact that a missing value does not necessarily mean a protein is not expressed, but only that the protein is below the detection limits of the mass spectrometer. The biological questions should be a driving force in the methodology used for multi-omics data integration.
Though multi-omics datasets can provide an individual with a greater depth of understanding in certain scenarios, this is not without cost. Omics studies often rely on large numbers of comparisons, the correct data type, appropriate statistical analyses, and a considerable investment of time, skilled personnel, and money. When constructing an experiment one must be weary of what types of omics data can and should be integrated to achieve the greatest understanding of the system being studied.98 High throughput omics platforms are not always necessary to answer the research question. Traditional techniques, such as enzyme-linked immunosorbent assay (ELISA) assays, immunohistochemistry (IHC), and quantitative polymerase chain reaction (qPCR), may be all that is necessary to validate a particular biological mechanism. In fact, these techniques are often required to validate the findings from a larger omics study in order to verify the significant molecule identified from omics data is a true positive result.
For the most part, current tools utilize clustering, networking, data reduction and Bayesian analysis. Because of ever increasing acquisition of data, resulting in large datasets and increasing numbers of them, machine learning will become more and more necessary for effective analysis and data mining. There is a need for accessible and well documented methods, tools and algorithms.96 As with all scientific endeavors, the easy questions will be answered first and “low hanging fruit” will be become less prevalent. Thus, there is a need for more effective algorithms and computing resources.88 Because of the variety of platforms used to generate multi-omics data, standardization of data formats would make integration easier.94
Future multi-omics data integration algorithms should take advantage of the “big data” resources (Table 1) and the advent of machine learning and artificial intelligence algorithms.10,119,120 Machine Learning has played an increasingly important role in allowing scientists to integrate multi-omics datasets. By utilizing a machines ability to compare and identify patterns in large quantities of biological data, we allow for far more accurate and efficient methods of elucidating complex cellular mechanisms and in some cases providing predictions to clinical outcomes. This is achieved through the computer's unique ability to observe multiple layers of omics data simultaneously providing a more holistic view of the systems at play, rather than observing each omic system individually and drawing simple conclusions based on visible correlations.4,120
New data integration methods should include variables related to each omic platform's weaknesses and limitations. Each method is limited by its statistical power, sample size, technical variables, batch effects, sequencing depths, sample preparation, and a multitude of other factors. These factors are important to keep in mind when designing, conducting and analyzing a study and interpreting the results. Therefore, it is highly recommended to involve a biostatistician/bioinformatician from the very beginning of any study. Their expert knowledge can be valuable at any stage of a study to prevent errors, wasting resources and optimize the study. The need for trainings program in this rapidly evolving field has been recognized by many institutes, such as Jackson Laboratory, Bioinformatics.org, UC Davis and Johns Hopkins, and many bioinformatics training programs are available online for free or with costs. Lastly, researchers should always remember to validate significant findings using other traditional wet lab techniques to unmask false positive results.
This journal is © The Royal Society of Chemistry 2021 |