Adriaan-Alexander
Ludl
and
Tom
Michoel
*
Computational Biology Unit, Department of Informatics, University of Bergen, PO Box 7803, 5020 Bergen, Norway. E-mail: tom.michoel@uib.no
First published on 17th December 2020
Causal gene networks model the flow of information within a cell. Reconstructing causal networks from omics data is challenging because correlation does not imply causation. When genomics and transcriptomics data from a segregating population are combined, genomic variants can be used to orient the direction of causality between gene expression traits. Instrumental variable methods use a local expression quantitative trait locus (eQTL) as a randomized instrument for a gene's expression level, and assign target genes based on distal eQTL associations. Mediation-based methods additionally require that distal eQTL associations are mediated by the source gene. A detailed comparison between these methods has not yet been conducted, due to the lack of a standardized implementation of different methods, the limited sample size of most multi-omics datasets, and the absence of ground-truth networks for most organisms. Here we used Findr, a software package providing uniform implementations of instrumental variable, mediation, and coexpression-based methods, a recent dataset of 1012 segregants from a cross between two budding yeast strains, and the YEASTRACT database of known transcriptional interactions to compare causal gene network inference methods. We found that causal inference methods result in a significant overlap with the ground-truth, whereas coexpression did not perform better than random. A subsampling analysis revealed that the performance of mediation saturates at large sample sizes, due to a loss of sensitivity when residual correlations become significant. Instrumental variable methods on the other hand contain false positive predictions, due to genomic linkage between eQTL instruments. Instrumental variable and mediation-based methods also have complementary roles for identifying causal genes underlying transcriptional hotspots. Instrumental variable methods correctly predicted STB5 targets for a hotspot centred on the transcription factor STB5, whereas mediation failed due to Stb5p auto-regulating its own expression. Mediation suggests a new candidate gene, DNM1, for a hotspot on Chr XII, whereas instrumental variable methods could not distinguish between multiple genes located within the hotspot. In conclusion, causal inference from genomics and transcriptomics data is a powerful approach for reconstructing causal gene networks, which could be further improved by the development of methods to control for residual correlations in mediation analyses, and for genomic linkage and pleiotropic effects from transcriptional hotspots in instrumental variable analyses.
These principles are combined in different ways in two classes of causal inference methods that use genomic variants as causal anchors: instrumental variable (known as Mendelian randomization in genetic epidemiology) or mediation-based.8Mediation infers the direction of causality between two traits that are statistically associated to the same genomic variant by testing whether the association between the variant and one of the traits is mediated by the other trait, in which case there must be a causal relation from the mediating trait to the other one.9,10 Mediation does not require that one of the traits has a “preferential” relation to the genomic variant (as in cis or trans). However, mediation fails in the presence of large measurement noise or hidden confounders, such as common upstream factors coregulating both traits, where it rejects true interactions (i.e. reports false negatives).11
Instrumental variable or Mendelian randomization methods assume that the genomic variant acts as a randomized “instrument” for one of the traits, similar to the random assignment of individuals to treatment groups in randomized controlled trials, such that a statistical association between the variant and the second trait is evidence for a causal relation from the first to the second trait. The random group assignment, in genetics the random segregation of alleles, ensures that causal effects can be detected even in the presence of confounding. However, instrumental variable methods fail if there are pathways from the variant to the second trait other than through the first trait (pleiotropic effects).12–14
A detailed comparison between these two approaches requires a standardized implementation where pre-processing (e.g. data normalization) and post-processing (e.g. multiple testing correction) are handled uniformly. Previously, we developed Findr, a computationally efficient software package implementing six likelihood ratio tests that can be combined in multiple ways to reconstruct instrumental variable as well as mediation-based causal gene networks.11Findr expresses the result of each test as a posterior probability (one minus the local false discovery rate), allowing tests to be combined by the usual rules of probability theory.10 This results in causal network inference methods that are representative for the broader field. For instance, the implementation of the mediation-based method in Findr is identical to the method of Chen et al.,10 which had its roots in the “likelihood-based causal model selection” (LCMS) procedure of Schadt et al.9 The Causal Inference Test (CIT) software15,16 is another implementation of an LCMS-based mediation method, which combines statistical tests using omnibus p-values and FDR estimates. We found previously that it results in similar inferences as the mediation-based method implemented in Findr.11 Instrumental variable methods on the other hand are based on genetic associations, for which Findr uses categorical regression of gene expression profiles on genotype values, similar to for instance the ANOVA option in Matrix-eQTL.17
Using simulated data from the DREAM5 Systems Genetics challenge,18,19 we found previously that instrumental variable methods generally outperformed mediation-based methods in terms of area under the precision–recall curve, and that the performance of mediation-based methods decreased with increasing sample size, due to increased statistical significance of confounding effects.11 However, at that time, no real-world dataset with sufficient sample size as well as an accurate ground-truth network of causal interactions was available to test these predictions in a real biological system.
Fortuitously, a dataset has now become available which includes genomic variation and gene expression data in more than 1000 segregants from a cross between two strains of budding yeast, a popular eukaryotic model organism.20 By learning networks from these data, and comparing against the wealth of transcriptional regulatory interactions and other functional validation data available for budding yeast,21 a thorough benchmarking of methods for reconstructing causal gene networks has become possible.
AUPR-ratio or fold-change is the AUPR divided by the theoretical value for random predictions on a given ground truth. The latter is obtained as the precision for random predictions given by precrandom = NE/(NR × NT) where NR is the number of regulating genes, NT is the number of target genes, NE is the number of edges, i.e. the number of ones in the ground-truth adjacency matrix.
Findr posterior probability matrices were thresholded to obtain discrete networks with an expected FDR target value as described previously:10,11 because for each interaction the local false discovery rate is given by fdr = 1 − p, where p is the posterior probability value obtained by the test, the expected FDR of a network consisting of all interactions with p ≥ pth is the average of the local fdr of the retained interactions. We determined pth as the threshold that gave the greatest expected FDR below the target value (5 or 10%). We counted the number of targets for each source gene with p > pth.
For validation we used networks of known transcriptional regulatory interactions in yeast (S. cerevisiae) from YEASTRACT.21 Regulation matrices were obtained from http://www.yeastract.com/formregmatrix.php. We retrieved the full ground-truth matrices containing all reported interactions of the following types from the YEASTRACT website: DNA binding evidence was used as the “Binding”, expression evidence including TFs acting as activators and those acting as inhibitors was used as the “Expression”, DNA binding and expression evidence was used as the “Binding & Expression”. Self regulation was removed from all ground truths. The numbers of regulators, targets and interactions for these three ground-truth networks are shown in Table 1.
Annotations of the yeast genome were used to map gene names to their identifiers and order them according to the position of their causal anchor (eQTL) along the full genome, first by chromosome and then by position along the chromosome. The sorting algorithm places mitochondrial genes first (when present) and orders the chromosomes according to the numerical value of the roman numerals. We used the gff3 file (Saccharomyces_cerevisiae.R64-1-1.83.gff3.gz) from the Ensembl database (release 83, December 2015),24 which is the version used by ref. 20. The file is accessible at http://ftp://ftp.ensembl.org/pub/release-83/gff3/saccharomyces_cerevisiae/.
Fig. 1 (A) Likelihood ratio (LLR) tests implemented in Findr. E is a causal anchor of gene A. Arrows in a hypothesis indicate directed regulatory relations. Genes A and B each follow a normal distribution, whose mean depends additively on its regulator(s), as determined in the corresponding hypothesis. The dependency is categorical on genotypes and linear on gene expression levels. The undirected line represents a multi-variate normal distribution between the relevant variables. In each test, either the null or the alternative hypothesis is selected, as shown. Figure ©2017 Wang, Michoel, reproduced by permission from ref. 11 under Creative Commons Attribution License. (B) Causal model selection with Findr. By combining the posterior probabilities Pi of the selected hypothesis for test i being true, Findr determines whether coexpressed genes A and B are connected by a causal A → B relation. (Row 1) In the absence of hidden confounders (H), mediation-based causal inference, combining Findr tests 2 and 3, correctly identifies true positive (TP; correlation due to causal A → B relation) and true negative (TN; correlation without causal A → B relation) models. However, it reports a false negative (FN) if the causal relation is affected by a hidden confounder. (Row 2) If the causal anchor is “exclusive” to gene A, then the instrumental variable method based on Findr test 2 correctly identifies TP and TN models, even in the presence of hidden confounding. However, it reports a false positive (FP) if the association between E and B is due to other paths than through A (pleiotropy). (Row 3) An instrumental variable method that combines Findr tests 2 and 5 correctly identifies a true negative if the correlation between A and B is entirely due to a pleitotropic effect of E, but will still report a false positive if there is an additional effect from a hidden confounder. (Row 4) An instrumental variable method based on the compound hypothesis that test 4 is true, or test 2 and test 5 are true, reports a TP for causal relations where E → A → B is not the only path from E to B, with or without confounding, but will report a FP if the true causal relation is B → A (or absent). |
• Mediation. Mediation-based approaches infer a causal interaction A → B if gene B is statistically associated to the causal anchor E, and the association is abolished after conditioning on gene A.9,10,16 In Findr this is accomplished by the compound hypothesis that test 2 and 3 are both true, i.e. by the posterior probability P2 × P3. Mediation can distinguish true positive (TP) from true negative (TN) causal interactions in the absence of hidden confounders, but will report a false negative (FN) if a real causal interaction is confounded by a hidden factor (Fig. 1B, row 1), due to a collider effect.10,11
• Instrumental variables without pleiotropy. Instrumental variable approaches assume that the causal anchor E acts as a randomized instrument for gene A, and, in their simplest form, infer a causal interaction A → B if gene B is statistically associated to the causal anchor E, i.e. by the posterior probability P2 that test 2 is true. Instrumental variables can distinguish true positive from true negative causal interactions even in the presence of hidden confounders, but will report a false positive (FP) if there are other pathways than through A that cause a statistical association between E and B (pleiotropy) (Fig. 1B, row 2).
• Instrumental variables with perfect pleiotropy. To address the problem of pleiotropy, we can additionally require that genes A and B are not independent after conditioning on E, accomplished by the compound hypothesis that test 2 and 5 are both true, i.e. by the posterior probability P2 × P5. This correctly identifies a true negative if E explains all of the correlation between A and B, but will still result in a false positive if there is a hidden confounder (Fig. 1B, row 3).
• Instrumental variables with partial pleiotropy. To overcome the problem of FP predictions in the “confounded pleiotropy” situation, we introduced test 4 in Findr, which tests whether gene B is not independent of E and A simultaneously, and found empirically that the combination performs better than P2 × P5 alone.11 In particular, it identifies a TP for causal A → B relations even in the presence of alternative E → B paths and hidden confounding, at the expense of FP predictions when the relation is reversed or absent (Fig. 1B, row 4).
• Coexpression. As a basic reference, we reconstructed a gene network based on coexpression alone, using Findr test 0. Note that the posterior probability P0 is not symmetric (P0(A → B) ≠ P0(B → A)), because it is estimated from the observed distribution of LLR test statistics for each A separately.11
To illustrate the differences between coexpression, instrumental variable, and mediation-based gene networks, we considered the sub-networks inferred between the 2884 genes that had a causal anchor (i.e. the sub-network where the probability of an edge can be estimated for both edge directions). As expected, the coexpression network (P0) is largely symmetric (Fig. 2, left), whereas the causal instrumental variable (P2, Fig. 2, center) and mediation-based (P2P3, Fig. 2, right) networks show a clear asymmetric structure with some genes having a very large number of high-confidence targets. These genes correspond to transcriptional hotspots, regions of the genome with a large, genome-wide effect on gene expression.20 The overall structure of the causal networks appears consistent with the general considerations above. The overall signal (strength of posterior probabilities) is weaker in the mediation-based network, consistent with an increased false negative rate (Fig. 2, right). On the other hand, the instrumental variable network appears to have a genomic structure, where nearby genes are mutually connected and have a similar target profile (Fig. 2, middle). This could be due to genomic linkage between causal anchors: if two genes A and A′ share the same or highly correlated instruments E and E′, then their predicted target sets would also be very similar, and probably include a large proportion of false positive predictions for either gene.
Fig. 2 Matrices of predicted gene interactions. These square matrices represent the interactions between 2884 genes with causal anchors (eQTLs), posterior probability values are color coded. Vertical bands correspond to hotspots. Left: The correlation based test P0. Center: The instrumental variable test P2. Right: The mediation test P2P3. The genes are ordered according to the position of their causal anchor in the full yeast genome. See ESI,† Fig. S1 for the instrumental variable tests P2P5 and P. |
Ground-truth network | N R | N T | N E | N sE |
---|---|---|---|---|
Binding | 90 | 5151 | 19099 | 28 |
Binding & Expression | 80 | 3394 | 5680 | 24 |
Expression | 113 | 5369 | 92646 | 77 |
These results are consistent with previous work on simulated data, where we observed a decrease in performance with increasing sample size for mediation-based methods.11 There we showed that hidden confounders and measurement noise can lead to a residual correlation between the causal anchor E and a target gene B after adjusting for the regulatory gene A (cf.Fig. 1). At sufficiently large sample size, this residual correlation becomes significantly different from zero and thereby leads to a false negative prediction.
Sample size showed little effect on the coexpression method P0 for sample sizes larger than 400 for all ground truths. This is consistent with the notion that estimates of correlations will stabilize around their true values at smaller sample sizes than estimates of causal effects.
To perform the analysis, we first truncated the posterior probability values in order to obtain discrete, directed networks. Thresholds were determined to obtain networks with an expected FDR ≤5% (for the instrumental variable methods) or ≤10% (for the mediation-based method) (see Methods section). The larger FDR value for mediation was chosen to counterbalance its increased false negative rate, and resulted in posterior probability thresholds that were comparable between all methods (Table 2).
Test | p th | FDR | N R | N T | N E | ρ |
---|---|---|---|---|---|---|
P 2 P 3 | 0.8175 | 0.09953 | 1808 | 5628 | 144091 | 0.014 |
P 2 | 0.825 | 0.04974 | 2884 | 5720 | 2319854 | 0.141 |
P 2 P 5 | 0.8375 | 0.04994 | 2884 | 5719 | 1740251 | 0.106 |
P | 0.8575 | 0.04982 | 2884 | 5720 | 2428039 | 0.147 |
Despite the similar posterior probability thresholds, the instrumental variable networks are around ten times more densely connected than those obtained by the mediation-based method (Table 2); a difference that cannot be explained by the lower sensitivity of the latter alone. We show that in the instrumental variable networks, high interaction counts occur in blocks that roughly follow the structure of the causal-anchor genotype covariance, whereas they occur more in spikes in the mediation network (Fig. 5). This becomes apparent when plotting the number of targets for each regulatory gene (i.e. each gene with a significant cis-eQTL) versus its position on the genome. In instrumental variable methods, the genomic causal anchor is used as a “proxy” for the regulatory gene. Hence, if the causal anchor genotypes of two genes within the same locus are correlated due to genomic linkage, then their target sets will unavoidably be similar as well, resulting in the pattern observed in Fig. 5. In mediation-based methods, the expression profile of the regulatory gene is used as the mediator (in test 3, cf.Fig. 1A), and hence a target set will be specific to a regulator, even when its causal anchor is correlated or shared with other genes.
Fig. 5 Hotspots and genotype covariance. (A) The counts of significant interactions for the mediation-based method P2P3 at FDR below 10%, with annotations for eight regulatory genes with more than 1000 targets. (B) The counts of significant interactions for the instrumental variable method P2 at FDR below 5% (in grey), and the number of non-zero effects for 102 hotspot markers from ref. 20 (in black); the subset of these hotspots that are also a causal anchor (i.e. the strongest local eQTL for at least one gene) for the Findr analysis are marked in green and are also indicated by diamonds at the top of the panel. Interaction count plots for the other instrumental variable methods are in ESI,† Fig. S3. (C) The diagonal of the genotype covariance matrix for the 2884 causal anchor eQTLs. Genes are ordered along the horizontal axis according to the position of their causal anchor in the yeast genome. |
STB5 is a transcription activator of multidrug resistance genes,27 and the only gene located in one hotspot region on chromosome VIII. The hotspot marker, chrVIII:459310_C/G, is located 11 bp upstream from STB5, and is the causal anchor for STB5 and for no other genes (Fig. 5B and Fig. 6, left). The instrumental variable method P2 predicted 131 targets at FDR below 5% for STB5, which are strongly enriched for STB5 targets in the Binding (hypergeometric p-value 2.3 × 10−12) and Binding & Expression (hypergeometric p-value 1.9 × 10−10) ground-truth networks. This suggests that when a hotspot can be confidently mapped to a single gene, instrumental variable methods predict biologically plausible target sets confirming the candidate causal hotspot gene. In contrast, the mediation-based method P2P3 predicted only nine STB5 targets at FDR below 10%, with no enrichment in the ground-truth networks. A possible mechanism that could explain the loss of sensitivity of mediation in this case was already suggested by Albert et al.:20STB5 does not show allele-specific expression, but carries protein-altering variants between the two yeast strains that were crossed, suggesting that the causal variants in this hotspot act by directly altering Stb5p protein activity; moreover Stb5p is predicted to target its own promoter in YEASTRACT. Taken together, this leads to a model where transcription of STB5 is a noisy measurement of Stb5p level, that does not block the path from the protein-altering variants to STB5 target genes via Stb5p protein level (Fig. S4, ESI†). Hence conditioning on STB5 transcription in P2P3 does not remove the association between these variants and the target genes completely, resulting in false negative predictions by a process similar to the measurement noise model studied in ref. 11.
DNM1 is a gene located near a hotspot region on chromosome XII, and is among the genes with highest target count in the mediation-based network (Fig. 5A and Fig. 6, right). The hotspot marker is also the causal anchor of DNM1, which is located 11363 bp downstream of this marker and outside the hotspot region mapped by Albert et al. Comparison with the target counts in the instrumental variable network, which closely follow the genotype covariance pattern, shows that DNM1 is the gene in this region that retains the most targets by far in the mediation network (P2, 2910 targets; P2P3, 1421 targets; Fig. 6, right). This is particularly true when compared with two of the four candidate causal genes of Albert et al. that also have a local eQTL within the hotspot region, YLL007C (also known as LMO1; P2, 2846 targets; P2P3, 139 targets) and MMM1 (P2, 3610 targets; P2P3, 8 targets). Based on the high specificity of the P2P3 test, we conjecture that DNM1 is a more likely causal gene for this hotspot than LMO1 or MMM1. Functional analysis in this case does not help to distinguish between these candidates, because Dnm1p and Mmm1p are both essential proteins for the maintenance of mitochondrial morphology,28 and Lmo1p is a signaling protein involved in mitophagy.29 However, deletion of DNM1 and MMM1 results in distinct mitochondrial phenotypes,28 and hence this prediction is experimentally testable in principle.
Causal inference is designed to reconstruct truly directed networks, by integrating genomics and transcriptomics data based on general principles of quantitative trait locus analysis.1 The publication of a dataset of more than 1000 yeast segregants has allowed us to demonstrate that causal inference indeed results in directed networks with strong, non-random overlap with networks of known transcriptional interactions. Moreover, the overlap was highest for the most reliable ground-truth that combined two sources of experimental evidence (DNA binding and response to perturbation). Although 1012 samples for an organism with around 6000 genes may seem a large number, our analysis also shows that there is no sign yet that performance is saturating as a function of sample size. Causal inference indeed requires larger sample sizes than coexpression-based methods, because it relies on more subtle patterns in the data, something that was already apparent in early considerations of causal inference in this context.5
Although the integration of genomics and transcriptomics data addresses the key shortcoming pertaining to lack of directionality in network inference when using transcriptomics data alone, important limitations remain. Apart from those already discussed at length in this paper—low sensitivity due to hidden confounders for mediation-based methods, and increased false positive rate due to genomic linkage for instrumental variable methods—another fundamental problem remains: transcriptional regulation is, for the most part, carried out by proteins. Hence, causal interactions inferred from genomics and transcriptomics data are by definition indirect. If an intermediate, unmeasured protein C (e.g., the protein product of A) lies on the path from gene A to gene B, that is, A also mediates associations between C and local eQTLs for A, then this does not affect the causal inference for the interaction A → B (Fig. S2, ESI†). However, variation in transcription level of a transcription factor (or other regulatory protein) does not always translate to equal variation in protein level, and vice versa. For instance, Albert et al.20 found several protein-altering variants in candidate causal genes mapped to hotspot regions that did not have any local eQTLs. In such cases, our methods would wrongly assign the trans-associated target genes to a gene with local eQTL (if one exists), and miss the non-varying (at transcription level) causal gene. This limitation can only be addressed by integrating another layer of information—proteomics data, which is not yet available in comparable sample sizes.
The methods implemented in Findr and analyzed in this paper are broadly representative of the current state-of-the-art for causal inference from genomics and transcriptomics data. Nevertheless, some ideas have been proposed recently that we did not evaluate here. For instance, in addition to the statistical tests implemented in Findr, Badsha and Fu33 propose to also use a causal anchor for the target gene B to obtain evidence for a causal interaction A → B. However, including this test requires limiting the analysis to interactions where both source and target genes have a significant eQTL. Yang et al.34 on the other hand propose to address the hidden confounder problem in mediation by adjusting for selected surrogate variables (e.g. principal components). However, such variables are necessarily composed of combinations of genes and it is challenging to ensure that they only represent common parents and no common children of an A → B interaction (which would introduce false positives if conditioned upon, see Fig. S2, ESI†). It will be of interest to include these developments in future comparisons.
The hypothesis-driven nature of causal inference lies in between the use of biophysical models of gene regulation and the application of unsupervised machine learning methods for reconstructing gene regulatory networks. Biophysical approaches attempt to include quantitative models of TF-DNA interactions into the network inference process,35,36 but are hampered by a lack of resolution in omics data (due to both noise within a sample, and limited sampling density). Unsupervised machine learning approaches search for non-random patterns within the data, but without specifying the type of pattern that corresponds to a real biological interaction, they lack the ability to identify truly directed associations. Supervised or semi-supervised methods, which use data of known regulatory interactions to label all or some of the gene pairs as interacting or not,37 could potentially overcome this limitation, but such labelling data is sparse for non-model organisms.
The agreement between theoretical predictions and empirical results indicates that we have a correct understanding of how causal gene network inference algorithms work. Moreover, we are able to interpret results in terms of which type of interactions these algorithms do and do not identify, albeit without any reference to the underlying biophysical mechanisms.
In general, we recommend instrumental variable over mediation-based methods, as their increased sensitivity tends to outweigh the higher specificity of mediation-based methods. The saturation of performance of mediation-based methods with increasing sample sizes is particularly worrying, although for most current datasets the point where performance saturates is probably not yet reached.
We found limited differences between instrumental variable methods. In the absence of any ground-truth data to evaluate results, we would generally recommend to use the P2P5 method, because it will remove at least the most obvious cases of pleiotropy from P2, while having an easier interpretation than the P method.
The main weakness of instrumental variable methods is their susceptibility to false positive predictions due to genomic linkage. This is a particular concern in data from experimental crosses or inbred organisms, where linkage blocks are large. However, in human data as well it has been found that around 10% of non-redundant local eQTLs are associated to the expression of multiple nearby genes.38 As illustrated, mediation-based causal inference and manual analysis of gene function can sometimes be used to resolve linkage of causal anchors.
In conclusion, causal inference from genomics and transcriptomics data is a more powerful approach for reconstructing causal gene networks than using transcriptomics data alone. This could be further improved upon by the inclusion of additional layers of omics data and by the development of methods to control for or find signal in residual correlations among genes in mediation analyses, and to resolve genomic linkage and pleiotropic effects from transcriptional hotspots in instrumental variable analyses.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d0mo00140f |
This journal is © The Royal Society of Chemistry 2021 |