MIBiClus: Mutual Information based Biclustering Algorithm

Most of the biclustering/projected clustering algorithms are based either on the Euclidean distance or correlation coefficient which capture only linear relationships. However, in many applications, like gene expression data and word-document data, non linear relationships may exist between the objects. Mutual Information between two variables provides a more general criterion to investigate dependencies amongst variables. In this paper, we improve upon our previous algorithm that uses mutual information for biclustering in terms of computation time and also the type of clusters identified. The algorithm is able to find biclusters with mixed relationships and is faster than the previous one. To the best of our knowledge, none of the other existing algorithms for biclustering have used mutual information as a similarity measure. We present the experimental results on synthetic data as well as on the yeast expression data. Biclusters on the yeast data were found to be biologically and statistically significant using GO Tool Box and FuncAssociate.





References:
[1] Kevin Y. Yip, David W. Cheung, and Micheal K. Ng:HARP: A Practical
Projected Clustering Algorithm. IEEE Transactions On Knowledge And
Data Engineering, volume 16, IEEE Computer Society, 2004.
[2] Charu C. Aggarwal, Cecilia Procopiuc, Joel L. Wolf, Philip S. Yu, and
Jong Soo Park: Fast Algorithms For Projected Clustering. ACM SIGMOD
Int-l Conf. Management of Data, pages 61-72, (1999).
[3] Charu C. Aggarwal and Philip S. Yu:Finding generalized projected clusters
in high dimensional spaces. ACM SIGMOD Int-l Conf. Management
of Data, pages 70-81, 2000.
[4] Y. Cheng and G. M. Church: Biclustering Of Gene Expression Data.
System Molecular Biology, volume 8, pages 1-93. System Molecular
Biology, 2000.
[5] H. Wang, W. Wang, J. Yang, and P. S. Yu:Clustering by Pattern Similarity
in Large Data Sets. Bull, Math. Biol. 46, pages 515 -527. ACM Press,
1984.
[6] Gad Getz, Erel Levine, and Eytan Domany: Coupled Two-Way Clustering
Analysis Of Gene Microarray Data. Cell Biology, volume 97, pages
12079-12084. PNAS, 2000.
[7] Yuval Kluger, Rones Basri, Joseph T. Chang, and Mark Gerstein: Spectral
Biclustering Of Microarray Data: Coclustering Genes And Conditions.
Genome Research, Cold Spring Harbor Laboratory Press, 2003.
[8] J. Ihmels, G. Friedlander, S. Bergmann, Y. Ziv O. Sarig, and N. Barkai:
Revealing Modular Organization In The Yeast Transcription Network.
Nature Genetics, volume 31, pages 1-370. Nature Publishing Group,
2002.
[9] S. Bergmann, J. Ihmels, and N. Barkai:Iterative signature algorithm
for the analysis of large-scale gene expression data. Physical Review,
volume 67, pages 1-18. American Physical Society, 2003.
[10] M. Kloster, C. Tang, and N.S. Wingreen:Finding regulatory modules
through large-scale gene-expression data analysis. Bioinformatics, volume
21, pages 1172-1179, 2005.
[11] Xiaowen Liu and Lusheng Wang: Computing the maximum similarity
bi-clusters of gene expression data. bioinformatics, volume 23, pages 50-
56, 2007.
[12] R.Steuer and J.Kurths and C.O.Daub and J.Weiseand and J.Selbig:The
mutual information:Detecting and evaluating depencies between variables.
Bioinformatics,volume 18 Suppl 2, pages S231-S240, 2002.
[13] O.Maimon I. Priness and I.Ben-Gal: Evaluation of gene expression
clustering via mutual information distance measure. BMC Bioinformatics,
2007.
[14] A.J. Butte and I.S. Kohane: Mutual Information relevance networks,
Functional Genomic clustering using pairwise entropy measurements.
PSB,volume 5, pages 415-426, 2000.
[15] X.Zhou, X. Wang, E.R. Dougherty, D.Russ and E. Suh: Gene Clustering
Based on Clusterwide Mutual Information. Jouranl of Computational
Biology, volume 11, Number 1, pages 147-161, 2004.
[16] N.SLonim, G.S.Atwal,G.Tkacik and W.Bialek: Information based clustering.
PNAS, volume 102, pages 18297-18302, 2005.
[17] MacQueen: Some methods for classification and analysis of multivariate
observations. Proc.of the 5th Berkeley Symposium om Mathematical
Statistics and Probability, 1965.
[18] T. Kohonen: Self Organizing Maps, Springer, 1997.
[19] R. Sharan and R. Shamir: CLICK: a clustering algorithm with applications
to gene expression analysis Proceedings of the 8th International
Conference on Intelligent Systems for Molecular Biology, 2000.
[20] N. Slonim: The information Bottleneck: Theory and applications. In
Ph.D. Thesis Tel Aviv University, Computer Science Department, 2002.
[21] Neelima Gupta and Seema Aggarwal: MIB: Using Mutual Information
for Biclustering .
[22] G.S.Michaels,D.B.Carr,M.Askenazi,S.Fuhrman,X.Wen and R.Somogyi:
Cluster Analysis and Data Visualization of Large Scale Gene Expression
Data. PSB, volume 3, pages 42-53, 1998.
[23] S.Mallela I.S.Dhillon and D.S.Modha: Information Theoritic COClustering.
Proceedings of SIGKDD-03, 2003.
[24] N.Slonim and N.Tishby: Document clustering using word clusters via
the information bottleneck method. Proceedings of the 23rd annual
international ACM SIGIR conference on Research and development in
information retrieval, p.208-215, July 24-28, 2000, Athens, Greece
[25] Saccharomyces Genome Database, http://www.yeastgenome.org/.
[26] Funcassociate: The Gene Set Functionator,
http://llama.med.harvard.edu/cgi/func1/funcassociate.
[27] Simon Haykin.: Neural Networks-A comprehensive Foundation 2nd Ed.,
Prentice Hall of India, 2007.