Xiaoying Li,
Yaping Lin* and
Changlong Gu
College of Information Science and Engineer, Hunan University, Changsha, Hunan 410012, China
First published on 23rd June 2017
Increasing evidence has indicated that microRNAs (miRNAs) regulate gene expression at the post-transcriptional level. Aberrant miRNA expression has been associated with many types of human disease, including cancers. Their associations can be used to understand the pathogenesis of diseases. However, using experimental methods to identify the associations between diseases and miRNAs is time consuming and costly. Computational methods could find the most promising miRNA-disease associations in a short time, thereby significantly reducing experimental time and cost. This paper presents a network similarity integration method (NSIM) for predicting potential miRNA-disease associations, considering that diseases associated with highly related miRNAs are more similar (and vice versa). The NSIM is based on 5425 experimentally verified human miRNA-disease associations, which consist of 495 miRNAs and 381 diseases. The NSIM integrates the disease similarity network, miRNA similarity network, and known miRNA-disease association network on the basis of cousin similarity to predict novel miRNA-disease associations. We evaluate the NSIM using leave-one-out cross validation. The area under the curve of the method is 0.9475, indicating its outstanding performance. Case studies on prostate, breast, and colon neoplasms further proved the outstanding performance of the NSIM to predict not only disease-related miRNAs but also isolated diseases (diseases without any related miRNAs).
In the last few years, many efforts have been exerted to identify potential miRNA-disease associations. Research using biological experimentation has determined a large number of miRNA-disease associations. Databases such as HMDD,13 miR2Disease,14 dbDEMC,15 miRCancer16 have been built to provide a platform for searching experimentally verified miRNA-disease associations. HMDD and miR2Disease are a collection of experimentally supported human miRNA-disease associations, manually retrieved on the basis of the literature. Database miRCancer stores miRNA-cancer associations, which are extracted using the rule-based text mining method. In addition, dbDEMC stores differentially expressed miRNAs in 14 human cancers by using significance analysis of microarrays to retrieve the miRNAs that have different expression levels in cancers when compared with normal tissues. These databases serve as a solid data foundation for predictive research of miRNAs in human diseases.
Considering that the experimental identification of disease-related miRNAs is time consuming and expensive, researchers proposed computational methods as important complementary ways to predict miRNA-disease associations. Computational methods mainly aim to select the most promising disease-related miRNAs for further experimental examination to reduce experimental time and cost. The key problem in miRNA-disease association inference is similarity calculation. These computational methods are divided into two categories:17 network-based methods18–26 and machine-learning-based methods.26–30
Network-based methods predict miRNA-disease associations in consideration of the hypothesis that functionally related miRNAs are usually associated with phenotypically similar diseases.13 This hypothesis was proposed by Lu et al.13 when they analyzed the human miRNA-disease association data in HMDD. Basing on this hypothesis, Jiang et al.18 constructed a functional association miRNA network, i.e., a human phenome-miRNAome network. For a given disease, they computed the similarity score of all human miRNAs in these networks and then prioritized all these miRNAs according to score. The top-ranked miRNAs were expected as the potential disease miRNAs. However, this model uses only the neighboring information of each miRNA and strongly relies on predicted miRNA-target interactions, thereby producing false-positive and false-negative results, that can influence the final prediction accuracy. Shi et al.21 presented a computational framework to identify miRNA-disease associations and further constructed a bipartite miRNA-disease network for systematically analyzing the global properties of miRNA regulation of disease genes. From these analyses, they found that most diseases in the same co-regulated module belong to the same category. Their work extended the previous hypothesis. However, this method is limited in application because of the low accuracy of target prediction and the fact that many disease–gene associations of miRNA-target interactions are unknown. On the basis of the weighted k most similar neighbours, HDMP22 was proposed to predict disease-related miRNAs. HDMP was used to evaluate the function similarity between miRNAs by considering disease terms and the phenotype similarity between diseases, as well as assigning higher weight to members of the miRNA family or cluster. However, HDMP only considers local network similarity measure and disregards diseases without any known related miRNA. Recently, Zou et al.26 have presented method KATZ, which uses the functional similarity score to denote the associations on the basis of the different lengths between the miRNA and disease nodes. However, the performance of KATZ is relatively poor on the spare known associations.
Machine-learning-based methods have been used to solve the problem by improving the classification accuracy and prediction performance. Jiang et al.29 proposed a Naïve Bayes model to rank candidate disease-related miRNAs through genomic data integration. This method strongly relies on datasets of disease–gene associations and miRNA-target interactions, but over half of human diseases are still unknown. To distinguish positive miRNA-disease associations from negative ones, Jiang et al.27 proposed a support vector machine approach by extracting the features based on miRNA-target data and phenotype similarity data. Considering the assumption that miRNAs implicated in a specific tumor phenotype show aberrant regulation of their target genes, Xu et al.30 prioritized novel disease miRNAs on the basis of the miRNA target-dysregulated network method. The common problem of the two aforementioned methods is that the negative training samples consisting of non-association between miRNAs and diseases do not demonstrate sufficient statistical confidence; the lack of a miRNA-disease association during observation in a biological experiment does not directly indicate absence of such an association. Chen et al.28 developed regularized least squares for miRNA-disease association (RLSMDA) to find potential miRNA candidates for a specific disease. RLSMDA is a semi-supervised method that integrates known disease-miRNA associations, disease–disease similarity dataset, and miRNA–miRNA functional similarity network. Despite its good prediction performance for diseases with or without related miRNAs, RLSMDA does not consider the topology information of the miRNA network.
The aforementioned methods have three main limitations. First, some methods are inefficient at cross-validation. Second, some approaches are unable to predict isolated disease-related miRNAs. Third, negative samples are difficult to obtain for some machine learning methods. Consequently, we propose a network similarity integration method (NSIM) to solve these limitations. The NSIM integrates miRNA similarities, diseases similarities, and known miRNA-disease association information to predict potential miRNA-disease associations. The advantages of the NSIM are as follows. First, this method is easy to understand and can effectively be implemented. Cross validations and global predictions for all 381 diseases are run simultaneously. Second, case studies about prostate, breast, and colon neoplasms demonstrate that the NSIM has good predictive performance. Third, the NSIM can also predict isolated diseases.
A disease A can be represented as a graph, DAG(A) = (A, TA, EA), where TA is the set of all ancestor nodes of A including A itself and EA is the set of corresponding links of A. The contribution of ancestor node t to A is defined as follows:
(1) |
(2) |
The semantic similarity score of disease A and disease B is defined as:
(3) |
(4) |
Here, we define the maximum similarity of disease d and diseases in DSj as the related score between disease d and miRNA j.
We define matrix MM as the miRNA–miRNA function similarity matrix, where MM(i,j) in row i and column j expresses the functional similarity score between miRNA i and miRNA j. By considering the contribution of the similarity diseases, the functional similarity of MM(i,j) is calculated as follows:
(5) |
(6) |
Considering disease i and disease j in matrix AS, we count the total number of commonly associated miRNAs of disease i and j, and then define it as M11. Similarly, M01 represents the total number of miRNAs that are only associated with disease i, M10 represents the total number of miRNAs that are only associated with disease j. The total number of miRNAs that are not associated with neither i nor j is disregarded. For a certain disease pair, the similarity value is set to 0 when the total number of miRNAs associated with these two diseases is zero.
We reconstruct the disease similarity network as:
(7) |
In the miRNA vector space, the similarity between miRNA i and all miRNAs is described as a vector VMMi, and MMi, (the ith row of matrix MM) is used to represent it. Likewise, the similarity between the associations of disease j and all miRNAs is described as a vector VDj, and ASj (the jth column of matrix AS) is used to represent it.
VMMi = MMi, |
VDj = ASj |
The miRNA space score is defined as
(8) |
Obviously, the higher the spatial similarity of miRNA i-associated miRNAs in the miRNA–miRNA similarity network is, the greater the association between miRNA i and disease j is. Similarly, the higher the spatial similarity of disease j-associated miRNAs in the known miRNA-disease network is, the greater the association between miRNA i and disease j is.
In the disease vector space, the similarity between the associations of miRNA i and all diseases is described as a vector VMi. We could use ASi, (the ith row of matrix AS) to represent it. Similarly, the similarity between disease j and all diseases is described as vector VSDj, and we could use SDj (the jth column of matrix SD) to represent it.
VMi = ASi, |
VSDj = SDj |
The disease space score is defined as
(9) |
Obviously, that the higher the spatial similarity of miRNA i-associated diseases in the known miRNA-disease network is, the greater the association of miRNA i and disease j is. Likewise, the higher the spatial similarity of the disease j associated diseases in disease similarity network is, the greater the association of miRNA i and disease j is.
Finally, the miRNA space score and disease space score are integrated together as
NSIM(i,j) = α × NSIM_M(i,j) + (1 − α) × NSIM_D(i,j) | (10) |
To our knowledge, HDMP,22 RLSMDA,28 KATZ,26 and the global network algorithm developed by Shi et al.21 are the-state-of-art computational approaches to predict miRNA-disease associations. We compared NSIM with RLSMDA and KATZ. HDMP could not predict disease without known associated miRNAs; the method developed by Shi et al. integrated the dataset from disease gene associations, miRNA-target interactions, and protein interactions, which were different from the dataset used in the NSIM.
We implemented a LOOCV for RLSMDA and KATZ. In the present study, the NSIM achieved an AUC value of 0.9475 when α is 0.5. For RLSMDA, when optimal parameters were selected as described in the literature, the AUC value was 0.8870. For KATZ, the AUC value was 0.9202. The comparison result of overall AUC between NSIM and RLSMDA, KATZ is shown in Fig. 4.
Fig. 4 The comparison result between NSIM, RLSMDA and KATZ was shown, which demonstrated the superiority performance of NSIM to other two methods. |
To obtain reliable judgment, we tested 19 human diseases that are related to at least 70 microRNAs respectively. As shown in Table 1, the NSIM achieved the highest AUC of 0.9446 with lung neoplasms and the lowest AUC of 0.8813 with esophageal neoplasms. The average AUC value for the 19 diseases was 0.9125 (Table 1). For RLSMDA, the average AUC value for the 19 diseases was 0.8450. The average AUC value was increased by 6.75%. For KATZ, the average AUC value for the 19 diseases was 0.8945. The average AUC value of the NSIM was 1.8% higher than that of KATZ. The AUC values of the NSIM for of neoplasms and ovarian neoplasms were lower than those of RLSMDA and KATZ. The AUC values of the NSIM for the 17 other diseases were all higher than those of RLSMDA and KATZ. Obviously, the prediction performance of NSIM was more accurate than those of RLSMDA and KATA.
Disease name | Number of associated microRNAs | AUC | ||
---|---|---|---|---|
NSIM | RLSMDA | KATZ | ||
Breast neoplasms | 202 | 0.9353 | 0.8951 | 0.9296 |
Carcinoma, hepatocellular | 214 | 0.9119 | 0.8631 | 0.9012 |
Carcinoma, non-small-cell lung | 95 | 0.9031 | 0.8342 | 0.8800 |
Carcinoma, renal cell | 107 | 0.8926 | 0.8172 | 0.875 |
Carcinoma, squamous cell | 80 | 0.9048 | 0.8386 | 0.8895 |
Colonic neoplasms | 78 | 0.8834 | 0.8232 | 0.8728 |
Colorectal neoplasms | 147 | 0.8845 | 0.8461 | 0.8819 |
Esophageal neoplasms | 74 | 0.8813 | 0.7747 | 0.8466 |
Glioblastoma | 96 | 0.9006 | 0.7934 | 0.8595 |
Glioma | 71 | 0.9131 | 0.8704 | 0.9114 |
Heart failure | 120 | 0.9071 | 0.8454 | 0.8636 |
Lung neoplasms | 132 | 0.9446 | 0.7844 | 0.9249 |
Melanoma | 141 | 0.9185 | 0.8850 | 0.8903 |
Neoplasms | 110 | 0.9436 | 0.8339 | 0.9751 |
Ovarian neoplasms | 114 | 0.9286 | 0.9630 | 0.9271 |
Pancreatic neoplasms | 99 | 0.9312 | 0.8991 | 0.9126 |
Prostatic neoplasms | 118 | 0.9209 | 0.8665 | 0.883 |
Stomach neoplasms | 174 | 0.9104 | 0.8217 | 0.8984 |
Urinary bladder neoplasms | 92 | 0.9227 | 0.8493 | 0.8732 |
Prostate neoplasm is the most common cancer among males in 84 countries,35 especially in developed countries. Prostate neoplasm is the second most common type of cancer and the fifth leading cause of cancer-related death among men worldwide.37 MiRNAs are over expressed during the progression of prostate neoplasms. Thus, miRNAs are promising diagnostic or prognostic biomarkers. For example, miR-409-3p, miR-361-3p, miR-133b, miR-221, and miR-128 are under expressed and miR-375, miR-141, miR-378*, and miR-203 are upregulated in prostate cancer.38–40 Candidate miRNAs were ranked in terms of scores obtained from the NSIM. The top 20 potential miRNAs associated with prostate neoplasms and evidence for the associations with prostate are listed in Table 2. Among the top 20 predicted prostate-related miRNAs, 18 have been confirmed by dbDEMC or miRCancer. Unconfirmed potential miRNA with the highest rank is has-mir-17 (ranked 4th). However, we found in the literature41,42 that the miR-17 family is over expressed in prostate neoplasms by targeting the p300/CBP-associated factor and modulating androgen receptor transcriptional activity in cultured prostate neoplasms cells.
Rank | miRNA | Evdances |
---|---|---|
1 | Hsa-mir-182 | dbDEMC, miRCancer |
2 | Hsa-mir-143 | dbDEMC, miRCancer |
3 | Hsa-mir-21 | dbDEMC, miRCancer |
4 | Hsa-mir-17 | PMID: 27650539 |
5 | Hsa-mir-34a | dbDEMC, C |
6 | Hsa-mir-100 | dbDEMC, miRCancer |
7 | Hsa-mir-126 | dbDEMC |
8 | Hsa-mir-150 | dbDEMC |
9 | Hsa-mir-20a | miRCancer |
10 | Hsa-mir-142 | Unconfirmed |
11 | Hsa-mir-200a | dbDEMC |
12 | Hsa-mir-203 | miRCancer |
13 | Hsa-mir-141 | miRCancer |
14 | Hsa-mir-31 | dbDEMC, miRCancer |
15 | Hsa-mir-146a | miRCancer |
16 | Hsa-mir-96 | dbDEMC, miRCancer |
17 | Hsa-mir-200c | dbDEMC |
18 | Hsa-mir-200b | miRCancer |
19 | Hsa-mir-223 | dbDEMC, miRCancer |
20 | Hsa-mir-9 | dbDEMC |
Breast neoplasm is the most common invasive cancer among women especially in developed countries, accounting for 25% of cancer cases among women. MiRNAs play regulatory roles in the invasion and metastasis of breast neoplasms. For example, miR-182, miR-21 are over expressed in breast neoplasms,12,35 and miR-205, miR-200c, miR-141, and miR-429 are down regulated in breast cancer.43 The top 20 potential miRNAs associated with breast neoplasms and evidence for the associations with breast are listed in Table 3. Among these candidate miRNAs, only 4 were not confirmed in the dbDEMC or miRCancer dataset. However, the literature44 provided information that miRNA hsa-mir-542 induces angiogenic inhibition in breast neoplasms.
Rank | miRNA | Evidences |
---|---|---|
1 | Hsa-mir-99a | dbDEMC, miRCancer |
2 | Hsa-mir-138 | dbDEMC |
3 | Hsa-mir-142 | miRCancer |
4 | Hsa-mir-106a | dbDEMC |
5 | Hsa-mir-130a | dbDEMC, miRCancer |
6 | Hsa-mir-378a | Unconfirmed |
7 | Hsa-mir-150 | dbDEMC, miRCancer |
8 | Hsa-mir-185 | dbDEMC, miRCancer |
9 | Hsa-mir-15b | dbDEMC |
10 | Hsa-mir-98 | dbDEMC, miRCancer |
11 | Hsa-mir-192 | dbDEMC |
12 | Hsa-mir-542 | PMID: 26272182 |
13 | Hsa-mir-196b | dbDEMC |
14 | Hsa-mir-92b | dbDEMC |
15 | Hsa-mir-186 | dbDEMC |
16 | Hsa-mir-30e | Unconfirmed |
17 | Hsa-mir-372 | dbDEMC |
18 | Hsa-mir-130b | dbDEMC |
19 | Hsa-mir-370 | dbDEMC |
20 | Hsa-mir-449a | Unconfirmed |
Colon neoplasm is the third most common cancer in the digestive tract worldwide. MiRNAs can be accurately diagnosed as biomarkers of colon neoplasms and can help predict colon neoplasms.45,46 MiRNA differential expression provides a promising application for early diagnosis and screening of colon neoplasms. For example, miR-21, miR-155, miR-31, miR-92a, and miR-17 are involved in the development of colon neoplasms.47 The top 20 potential miRNAs associated with colon neoplasms and evidence for the associations with colon neoplasms are listed in Table 4. Among these candidate miRNAs, 5 were not confirmed by the dbDEMC or miRCancer dataset. Nevertheless, they all have been identified in the literature. The PMID of the literature is shown in the tables.
Rank | miRNA | Evidences |
---|---|---|
1 | Hsa-mir-20a | dbDEMC |
2 | Hsa-mir-18a | dbDEMC, miRCancer |
3 | Hsa-mir-19b | dbDEMC |
4 | Hsa-mir-21 | dbDEMC, miRCancer |
5 | Hsa-mir-143 | dbDEMC, miRCancer |
6 | Hsa-mir-19a | dbDEMC |
7 | Hsa-mir-155 | dbDEMC, miRCancer |
8 | Hsa-mir-92a | PMID: 26463716 |
9 | Hsa-mir-125b | PMID: 24774301 |
10 | Hsa-mir-29b | PMID: 26466603 |
11 | Hsa-mir-34a | dbDEMC, miRCancer |
12 | Hsa-mir-146a | dbDEMC |
13 | Hsa-mir-16 | PMID: 22049153 |
14 | Hsa-mir-106b | dbDEMC |
15 | Hsa-let-7a | miRCancer |
16 | Hsa-mir-181a | dbDEMC, miRCancer |
17 | Hsa-mir-31 | dbDEMC, miRCancer |
18 | Hsa-mir-15a | dbDEMC |
19 | Hsa-mir-150 | PMID: 24705249 |
20 | Hsa-mir-221 | dbDEMC |
The above results demonstrate that the NSIM performs well in predicting potential disease-associated miRNA candidates.
We take isolated disease j as an example, VDj = ASj = null vector and NSIM_M(i,j) = 0. The predictor score between miRNA i and disease j is calculated by NSIM_D(i,j). The disease similarity consists of disease semantic similarities (eqn (3) DD) and disease similarities of known miRNA-disease associations (eqn (6) DAS). When disease j is an isolated disease, DAS(i,j) = null vector; and DD calculated by disease MeSH DAG, do not depend on the associated miRNAs. What we call isolated disease refers to a disease without any known related miRNAs, and the associations between the disease and other diseases exists. So we use DD(i,j) as SD(i,j) to calculate NSIM_D(i,j). Therefore, our method can be applied to predict isolated disease-related miRNAs.
The average AUC of NSIM to predict isolated diseases is 0.8146. The predicted results of breast neoplasms are listed in Table 5.
Rank | miRNA name | Evidences |
---|---|---|
1 | Hsa-mir-99a | dbDEMC, miRCancer |
2 | Hsa-mir-663b | Unconfirmed |
3 | Hsa-mir-138 | dbDEMC |
4 | Hsa-mir-331 | dbDEMC |
5 | Hsa-mir-185 | dbDEMC, miRCancer |
6 | Hsa-mir-372 | dbDEMC |
7 | Hsa-mir-378a | Unconfirmed |
8 | Hsa-mir-1224 | Unconfirmed |
9 | Hsa-mir-130a | dbDEMC, miRCancer |
10 | Hsa-mir-98 | dbDEMC, miRCancer |
11 | Hsa-mir-532 | dbDEMC |
12 | Hsa-mir-370 | dbDEMC |
13 | Hsa-mir-542 | Unconfirmed |
14 | Hsa-mir-498 | dbDEMC |
15 | Hsa-mir-371a | Unconfirmed |
16 | Hsa-mir-142 | miRCancer |
17 | Hsa-mir-130b | dbDEMC |
18 | Hsa-mir-150 | dbDEMC, miRCancer |
19 | Hsa-mir-449a | Unconfirmed |
20 | Hsa-mir-15b | dbDEMC |
The current version of NSIM has limitations. Despite its good performance, the NSIM was constructed on basis of miRNA-disease associations. The number of associations affected the prediction accuracy. The more the number of associations, the more accurate the prediction is. Hence, the performance of the NSIM could be improved by obtaining more miRNA-disease associations. Furthermore, this method only considers the semantic relation in calculating the disease similarity score. Information on gene-disease, miRNA-lncRNA, and miRNA-target gene interactions could further improve the similarity measure between miRNAs and diseases.
Footnote |
† Electronic supplementary information (ESI) available: A supplemental table is available as a single excel file. See DOI: 10.1039/c7ra05348g |
This journal is © The Royal Society of Chemistry 2017 |