Evaluation of Clustering Based on Preprocessing in Gene Expression Data

Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.





References:
[1] J. Quanckenbush, "Computational analysis of microarray data,"
Nat.Genet. vol. 2, 2001, pp. 418-427.
[2] J. A. Hartigan, M. A. Wang, "A k-means clustering algorithm," Appl.Stat.
vol.28, 1979, pp. 100-108.
[3] S. Y. Kim, J. W. Lee, "Ensemble clustering method based on the
resampling similarity measure for gene expression data," Statistical
methods in medical research, vol. 16, 2007, pp. 539-564.
[4] A. Weingessel, E. Dimitriadou, K. Hornik, "An ensemble method for
clustering," DSC Working papers, 2003. See also
http://www.ci.tuwien.ac.at/Conferences/ DSC-2003.
[5] L. Kaufman, P. J. Rousseeuw, Finding Groups in Data: An Introduction
to Cluster Analysis. John Wiley, New York, 1990.
[6] J. C. Bezdek, Pattern Recognition with Fuzzy Objective Function
Algorithms. Plenum Press, New York, 1981.
[7] T. Speed, Statistical Analysis of Gene Expression Microarray Data.
Chapman & Hall, New York, 2003.
[8] S. Dudoit, J. Fridlyand, "A prediction-based resampling method for
estimating the number of clusters in a dataset,".Genome Biology, vol.3,
2002, research0036.1-0036.21.
[9] S. Datta, S. Datta, "Comparisons and validation of statistical clustering
techniques for microarray gene expression data," Bioinformatics vol.19,
2003, pp. 459-466.
[10] Y. Luan, H. Li, "Clustering of time-course gene expression data using a
mixed-effects model with B-splines," Bioinformatics vol.19, 2003, pp.
474-482.
[11] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P.
Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D.
Bloomfield, E. S. Lander, "Molecular classification of cancer: Class
discovery and class prediction by gene expression monitoring," Science
vol. 286, 1999, pp. 531-537.
[12] Y. H. Yang, S. Dudoit, P. Luu, T. P. Speed: Normaliztion for cDNA
microarray data, eds. M. Bittner, Y. Chen, A. Dorsel, E. Dougherty,
Microarrays: Optical Technologies and Informatics SPIE, 2001.
[13] K. Y. Yeung, W. L. Ruzzo, "An empirical study on principal component
analysis for clustering gene expression data," Technical Report 2000
UW-CSE-00-11-01, Department of Computer Science and Engineering,
University of Washington, 2001.
[14] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M.
Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty,
E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J.
Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D.
Alberts, V. Sondak, "Molecular classification of cutaneous malignant
melanoma by gene expression profiling," Nature vol.406, 2002, pp.
536-540.
[15] A. Bhattacharjee, W. G. Richards, J. Staunton, C. Li, S. Monti, P. Vasa, C.
Ladd, J. Beheshti, R. Bueno, M. Gillette, M. Loda, G. Weber, E. J. Mark,
E. S. Lander, W. Wong, B. E. Johnson, T. R. Golub, D. J. Sugarbaker, M.
Meyerson, "Classification of human lung carcinomas by mRNA
expression profiling reveals distinct adenocarcinomas sub-classes,"
Proc.Natl. Acad.Sci. vol. 98, 2001, pp. 13790-13795.
[16] R. Tibshirani, G. Walther, T. Hastie, "Estimating the number of clusters in
a dataset via the gap statistic," Technical Report, Department of
Biostatistics, Stanford University, 2000.
[17] R. G. Darlene, G. Debashis, M. C. Erin, "Statistical issues in the clustering
of gene expression data," Statistica Sinica vol.12, 2002, pp. 219-240.
[18] Y. Zhao, M. C. Li, R. Simon, "An adaptive method for cDNA microarray
normalization," BMC Bioinformatics vol. 6; 28, 2005.
[19] D. Dembele, P. Kastner, "Fuzzy C-means method for clustering
microarray data," Bioinformatics vol. 19, 2003, pp. 973-780.
[20] V. Guralnik, G. Karypis, "A scalable algorithm for clustering protein
sequences," Workshop on Data Mining in Bioinformatics, Proceedings of
the U.S.A., 2001, pp. 73-80.
[21] J. A. Berger, S. Hautaniemi, A. K. Jarvinen, H. Edgren, S. K. Mitra, J.
Astola, "Optimized LOWESS normalization parameter selection for DNA
microarray data," BMC Bioinformatics vol. 5, 2004, pp. 194.