Nitay
Itzhacky
and
Roded
Sharan
*
School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel. E-mail: roded@tauex.tau.ac.il
First published on 2nd November 2020
Detecting cancer dependencies is key to disease treatment. Recent efforts have mapped gene dependencies and drug sensitivities in hundreds of cancer cell lines. These data allow us to learn for the first time models of tumor vulnerabilities and apply them to suggest novel drug targets. Here we devise novel deep learning methods for predicting gene dependencies and drug sensitivities from gene expression measurements. By combining dimensionality reduction strategies, we are able to learn accurate models that outperform simpler neural networks or linear models.
In spite of the tremendous promise of sgRNAs, it has been shown that genes within highly amplified regions have a great deal of false-positive results.1 In addition, the fact that the on-target activity and off-target effects of individual sgRNAs can vary widely is another cause for the inaccuracy of the method.5 Several computational methods were developed to address this problem.5,6 In particular, CERES7 significantly decreases the false-positive in copy number-amplified regions.
The cancer dependency map project (http://depmap.org) aims to systematically catalog and identify biomarkers of genetic vulnerabilities and drug sensitivities in hundreds of cancer models and tumors, to accelerate the development of precision treatments. In particular, it reports the Achilles data set which contains gene dependency values obtained with CRISPR-Cas9 screening and using the CERES algorithm for data cleaning and normalization. These values represent the effects of gene knockdowns on the survival of cancer cell lines, where large effects point to cancer vulnerabilities. A detailed description of the dataset is provided in Methods.
A fundamental question concerning this important resource is whether one can automatically learn probabilistic models that will allow predicting gene dependencies or drug sensitivities for new samples or drugs. Previous work in this domain was mainly focused on small sets of target cell lines or drugs.8–11 Related work for drug sensitivity prediction focused on inferring the sensitivity of new cell lines to a fixed set of drugs, leaving open the reverse prediction of cell line sensitivity to new, previously unobserved, drugs.12–15
Here we aim to address this question of the predictive power of transcriptomics data in estimating gene dependencies and drug sensitivity. We design a neural network regression model to predict the dependency data of new genes in various cell lines based on the gene's expressions. Furthermore, given gene expression measurements of a cell line, we design a neural network that combines an encoder–decoder component and a prediction component to predict the sample's gene dependencies. Finally, we construct a neural network to predict drug sensitivities based on their target gene's expression and dependencies. By combining dimensionality reduction strategies, we are able to learn accurate models that outperform simpler neural networks or linear models. Our models and implementation are publicly available at: https://github.com/cstaii2020/cancer_dependency_prediction.
First, we aim to predict the dependencies of a gene (across cell lines) from its expression profile. Before tackling this problem, we wish to assess the correlation between gene expression and gene dependency. To this end, we use a common multivariate statistical method: canonical correlation analysis (CCA, see Method). CCA considers the expression and dependency information of a gene as two views of the same entity, aiming to project these views into a common subspace in which their cross-correlation is maximized. The analysis over the 630 available cell lines shows significant correlations between the two views of the data, the highest being 0.74 (for the first CCA component, see Fig. 1a for the full results), implying that the dependency information may be predictable from the expression data.
In order to be able to predict gene dependencies across cell lines, we need to solve a multivariate regression task, in which every cell line is a regression's dependent variable. We note that while a previous work approached this problem, it was focused on a very small number of specific cell lines and used an obsolete version of the dependency data,8 hence we could not readily compare to it.
We attempt to predict the dependency data using various models. First, we tested a simple linear regression model. We got a cross-validation score of 0.55 ACC (average correlation coefficient across cell lines and folds, see Methods; Spearman correlation of 0.37). This confirms that a gene expression profile is a very good predictor of the gene's dependencies. Next, we attempted to improve this result by adding L2 regularization to the linear model (ridge regression). This yielded a slight improvement to an ACC of 0.57 (Spearman correlation of 0.38). Last, we attempted to use a non-linear model to improve performance further. To this end, we built a feed-forward neural network model with a non-linear activation function. Our cross-validation test tunes the hyperparameters that concern the network architecture (i.e., number of hidden layers and number of neurons per layer, see Methods). This yielded a further improvement to an ACC of 0.6 (Spearman correlation of 0.385).
In order to further improve performance, we explored the impact of dimensionality reduction of the gene expression data using the CCA results. We observed that the correlations of the different CCA components are high (Fig. 1a) and so we fed the CCA-reduced expression data as features to the neural network model, with the data dimension as a hyperparameter. Altogether, this improved scheme led to an ACC score of 0.61 (Spearman correlation of 0.39). The full results are summarized in Fig. 1b.
The second prediction problem we consider is the prediction of gene dependencies in yet unobserved cell lines (or samples). First, we study the gene dependency data distribution. A plot of gene dependencies is given in Fig. 2. From the plot, it is evident that genes can be roughly classified to ∼13000 “constant”, i.e., close to 0 or close to 1 with low standard deviation (<0.1), and ∼5000 “variable”, i.e., between 0 and 1 with high standard deviation (>0.1).
Fig. 2 Scatter plot of gene dependencies. Shown are the mean and standard deviation of the dependencies of each gene across cell lines. |
We focus here on the prediction of dependencies for the variable group of genes. Such a prediction problem is challenging since a direct prediction of the dependency values (e.g., using linear regression) requires learning about 90 M parameters while the number of data points is around 30 M. Similarly, for simple neural network architecture, the relatively small number of data points for training greatly limits the number and size of the hidden layers (see Methods). Indeed, a neural network model without dimensionality reduction provided poor results. Specifically, it performed best with 3 hidden layers of 100 neurons each, yielding an ACC across all genes of 0.14, with 1172 genes having a correlation higher than 0.2.
To tackle the dimensionality challenge we aimed to reduce the data dimension. However, the dimensions of both the expression data and the dependency data, are much larger than the sample size (the number of cell lines). In such a setting, the canonical correlations can be extremely misleading as they are generally substantially overestimated.16 Indeed, when we applied CCA to these data, the resulting canonical correlations were always one, implying that they do not carry any information about the true population canonical correlations.
Therefore, we chose to reduce the dimension of the independent variables of expression data and the dependent variable of dependency values using autoencoders.17 An autoencoder consists of an encoder and a decoder. Given the input data n-dimensional data point x, the encoder first maps x to its latent d-dimensional representation z = E(x), where typically d < n. The decoder then maps z to a reconstruction x′ = D(z) (also n-dimensional), with reconstruction error measuring the deviation between x and x′. The encoder consists of an input layer, one hidden layer, and an encoded layer. The decoder starts at the encoded layer, uses one hidden layer and finally outputs the reconstruction layer (see Fig. 3). The parameters of the network are updated via backpropagation with the goal of minimizing the reconstruction error (see Methods). By restricting the latent space to lower dimensionality than the input space (d < n) the trained autoencoder parametrizes a low-dimensional nonlinear manifold that captures the data structure.
In more detail, the dependency data spans 630 cell lines, each with expression measurements of 18239 genes and 4902 measurements of dependency values for variable genes. We filtered out from the input genes for which the expression data has more than 80% zero values to ensure sufficient information through the cross-validation iterations. This resulted in a final set of 17040 genes. Autoencoders were applied to reduce the dimension of the expression profiles to 500 and the dependency profiles to 300 (these values were chosen to respect the volume of the data available for training, see Methods). We evaluated the autoencoders in a 5-fold cross-validation setting. We found that the average Pearson correlation between the expression data and the reconstructed expression data was 0.94, while the same measure of the dependency data autoencoder was 0.82. These high correlations indicate that the dimension-reduced profiles can well represent the original data.
When combining the autoencoders into the neural network regression scheme, we could explore larger layers (Methods), achieving the best results with 6 hidden layers of 600 neurons each. The average Pearson correlation across all genes was 0.18, while 1639 genes had a correlation higher than 0.2. These results and the comparison to a neural network that does not use the autoencoders are summarized in Fig. 4.
We explored two sources of information for the prediction task. First, as above, we used gene expression data. Specifically, for every drug, we obtained its set of known targets (from DepMap) and formed mean expression vectors for those targets across the 524 cell lines. For this task, we first explored ridge regression as a linear model that achieves a cross-validation score of 0.36 ACC. Second, we use a neural network regression model which presents a cross-validation score of 0.39 ACC.
Next, we wish to explore the prediction of the drug sensitivities based on the corresponding gene dependencies. We formed mean vectors from the dependency data of the drug's targets, similar to the mean gene expression as before. Applying ridge regression to these data yielded an ACC score of 0.37. A neural network regression model performed better with a cross-validation score of 0.39 ACC.
Finally, the combination of the two data sources used above can be exploited for predicting drug sensitivities. To this end, we concatenated the expression and dependency data and tested both a linear model and a neural network model. The linear model performance improved to an ACC of 0.39. The neural network model also benefits from utilizing both data sources, achieving an ACC score of 0.4. The results are summarized in Fig. 5.
Our work focused on predicting genetic dependencies from transcriptomics data as this remains the most abundant data type to date. Yet, our framework can be strengthened by combining other data types such as DNA methylation data, proteomics, interactome data, etc. In addition, while our algorithms achieved promising results, their performance in a real clinical setting where samples come from real patients will need to be assessed when such dependency data becomes available.
Overall, 4902 genes were variable, 12500 near 0 (precisely, mean <0.5 and std <0.1) and 837 near 1 (precisely, mean >0.5 and std <0.1). When learning the variable gene dependencies, we learned from the 630 cell lines for which the dependency data for all these genes were given.
Evaluation of a regression task is typically made by the Pearson correlation coefficient. It is preferable on measures like mean squared error, since the value of the error can be of any range. As the regression problems we solve have multiple dependent variables, we compute the correlation for every variable separately and then average the results across all of them,23 we call it ACC (average correlation coefficient). The score of a cross-validation test is thus the average ACC across all test folds.
(1) |
In order to choose the number and size of layers in the deep network, we used cross-validation. We explored 1–10 hidden layers and layer sizes of 100 to 1000. We limited the search to architectures in which the number of weights to learn is at most the number of available data points to train on. In addition, we limited the search so that the number of neurons of a hidden layer is roughly between the sizes of the input and output layers.
This journal is © The Royal Society of Chemistry 2021 |