Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance

In this paper, we propose an algorithm to compute initial cluster centers for K-means clustering. Data in a cell is partitioned using a cutting plane that divides cell in two smaller cells. The plane is perpendicular to the data axis with the highest variance and is designed to reduce the sum squared errors of the two cells as much as possible, while at the same time keep the two cells far apart as possible. Cells are partitioned one at a time until the number of cells equals to the predefined number of clusters, K. The centers of the K cells become the initial cluster centers for K-means. The experimental results suggest that the proposed algorithm is effective, converge to better clustering results than those of the random initialization method. The research also indicated the proposed algorithm would greatly improve the likelihood of every cluster containing some data in it.




References:
[1] A. Jain, and R. Dubes, Algorithms for Clustering Data, Prentice-Hall,
Englewood Cliffs. NJ, 1998.
[2] A. Likas, N. Vlassis and J.J. Verbeek, "The Global k-means Clustering
algorithm", Pattern Recognition , Volume 36, Issue 2, 2003, pp. 451-
461.
[3] P.S. Bradley and U.M. Fayyad, "Refining initial points for K-means
Clustering", Proceeding of The Fifteenth International Conference on
Machine Learning, Morgan Kaufmann, San Francisco, CA, 1998, pp.
91-99.
[4] C.L. Blake, C.J. Merz. UCI Repository of machine learning databases.
University of California, Irvine, Department of Information and
Computer Science, 1998.
[5] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan
Kaufmann Publishers, San Diego, 2001.
[6] P. Mitra, C.A. Murthy, S.K. Pal, "Density based multiscale data
condensation", IEEE Trans, Pattern Anal, Machine Intell, 24 (6), 2002,
pp. 734-747.
[7] S. S. Khan and A. Ahmad, "Cluster Center Initialization for K-mean
Clustering", Pattern Recognition Letters, Volume 25, Issue 11,
2004, pp. 1293-1302
[8] Y. Sirisathitkul, S. Auwatanamongkol and B. Uyyanonvara, "Color
image quantization using distances between adjacent colors along the
color axis with highest color variance", Pattern Recognition
Letters, Volume 25, Issue 9, 2004, pp. 1025-1043.