Application of KL Divergence for Estimation of Each Metabolic Pathway Genes

Development of a method to estimate gene functions is
an important task in bioinformatics. One of the approaches for the
annotation is the identification of the metabolic pathway that genes are
involved in. Since gene expression data reflect various intracellular
phenomena, those data are considered to be related with genes’
functions. However, it has been difficult to estimate the gene function
with high accuracy. It is considered that the low accuracy of the
estimation is caused by the difficulty of accurately measuring a gene
expression. Even though they are measured under the same condition,
the gene expressions will vary usually. In this study, we proposed a
feature extraction method focusing on the variability of gene
expressions to estimate the genes' metabolic pathway accurately. First,
we estimated the distribution of each gene expression from replicate
data. Next, we calculated the similarity between all gene pairs by KL
divergence, which is a method for calculating the similarity between
distributions. Finally, we utilized the similarity vectors as feature
vectors and trained the multiclass SVM for identifying the genes'
metabolic pathway. To evaluate our developed method, we applied the
method to budding yeast and trained the multiclass SVM for
identifying the seven metabolic pathways. As a result, the accuracy
that calculated by our developed method was higher than the one that
calculated from the raw gene expression data. Thus, our developed
method combined with KL divergence is useful for identifying the
genes' metabolic pathway.





References:
[1] T. Obayashi, Y. Okamura, S. Ito, S. Tadaka, Y. Aoki, M. Shirota, and K.
Kinoshita, “ATTED-II in 2014: evaluation of gene coexpression in
agriculturally important plants,” Plant Cell Physiol., vol. 55, no. 1, p. e6,
Jan. 2014.
[2] K. Aoki, Y. Ogata, and D. Shibata, “Approaches for extracting practical
information from gene co-expression networks in plant biology,” Plant
Cell Physiol., vol. 48, no. 3, pp. 381-390, Mar. 2007.
[3] K. Saito, M. Y. Hirai, and K. Yonekura-Sakakibara, “Decoding genes
with coexpression networks and metabolomics - ‘majority report by
precogs’,” Trends Plant Sci., vol. 13, no. 1, pp. 36-43, Jan. 2008.
[4] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol.
20, pp. 273-297, 1995.
[5] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector
machine,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1-27, Apr.
2011.
[6] M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S.
Furey, M. Ares, and D. Haussler, “Knowledge-based analysis of
microarray gene expression data by using support vector machines,” Proc.
Natl. Acad. Sci. U. S. A., vol. 97, no. 1, pp. 262-267, Jan. 2000.
[7] S. Kullback, and R. A. Leibler, “On information and sufficiency,” Annals
of Mathematical Statistics, vol. 22, pp. 79-86, 1951.
[8] R. Edgar, M. Domrachev, and A. E. Lash, “Gene Expression Omnibus:
NCBI gene expression and hybridization array data repository,” Nucleic
Acids Res., vol. 30, pp. 207-210, 2002.
[9] E. Hubbell, W. M. Liu, and R. Mei, “Robust estimators for expression
analysis,” Bioinformatics, vol. 18, pp. 1585-1592, 2002.
[10] S. D. Pepper, E. K. Saunders, L. E. Edwards, C. L. Wilson, and C. J.
Miller, “The utility of MAS5 expression summary and detection call
algorithms,” BMC Bioinformatics, vol. 8, p. 273, 2007.
[11] M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi, and M.
Tanabe, “Data, information, knowledge and principle: back to
metabolism in KEGG,” Nucleic Acids Res., vol. 42, no. Database issue,
pp. D199-205, Jan. 2014.