Genetic variant effect prediction by supervised nonnegative matrix tri-factorization
Abstract
Discriminating between deleterious and neutral mutations among numerous non-synonymous single nucleotide variants (nsSNVs) that may be observed through whole exome sequencing (WES) is considered a great challenge. In this regard, many machine learning methods have been developed for the prediction of variant consequences based on the analysis of either protein amino acid sequences or protein structures or their integration with features extracted from various gene level data and phenotype information. Due to the availability of a high number of features and heterogeneity of sources, implementing a suitable integration method plays an important role in predictive models. In this study, we proposed a novel supervised nonnegative matrix tri-factorization (sNMTF) algorithm to integrate current variant prediction scores into the gene level data and disease networks. In this regard, a new feature space was constructed by the integration of all input data using sNMTF to provide appropriate inputs for training a classifier. For the assessment of the proposed model, we utilized two benchmark datasets. The first one contained 11 207 deleterious and 19 839 neutral nsSNPs, whereas for the other dataset we used 4416 and 4960 deleterious and neutral nsSNPs, respectively. In general, the evaluation of our proposed supervised NMTF method on both datasets indicated that, in comparison with the existing nsSNV effect prediction approaches, regardless of whether they are ensemble-based or not, our method exhibited a better performance, which resulted in a higher prediction accuracy on average of 15% than other ensemble scores. In addition, excluding any kind of data that were integrated into the final model led to a substantial decrease in deleterious variant prediction. The proposed model can be used as an extensible framework for integrating more hetergeneous sources.