K-Means for Spherical Clusters with Large Variance in Sizes

Data clustering is an important data exploration technique with many applications in data mining. The k-means algorithm is well known for its efficiency in clustering large data sets. However, this algorithm is suitable for spherical shaped clusters of similar sizes and densities. The quality of the resulting clusters decreases when the data set contains spherical shaped with large variance in sizes. In this paper, we introduce a competent procedure to overcome this problem. The proposed method is based on shifting the center of the large cluster toward the small cluster, and recomputing the membership of small cluster points, the experimental results reveal that the proposed algorithm produces satisfactory results.




References:
[1] R. Agrawal , J. Gehrke , D. Gunopulos, and P. Raghavan, "Automatic
SubspaceClustering of High Dimensional Data for Data Mining
Applications," Proc.ACM SIGMOD Int. Conf. on Management of Data,
Seattle, WA, 1998, pp. 94-105.
[2] M R. Anderberg Cluster Analysis for Applications, Academic Press,
1973.
[3] M. Ankerst , M. Breunig , H.P. Kriegel , and J. Sander, "OPTICS:
Ordering Points to Identify the Clustering Structure," Proc. ACM
SIGMOD Int. Con. Management of Data Mining, 1999, pp. 49-60.
[4] L. Bottou, and Y. Bengio, "Convergence properties of the k-mean
algorithm," The MIT press, Cambridge, MA, 1995, pp. 585-592.
[5] P. S. Bradley , and U. M. Fayyad, "Refining Initial Points for K-Means
Clustering," Proc. of the 15th International Conference on Machine
Learning (ICML98), J. Shavlik (ed.). Morgan Kaufmann, San Francisco,
1998, pp. 91-99.
[6] S. Deelers, and S. Auwatanamongkol, "Enhancing K-Means Algorithm
with Initial Cluster Centers Derived from Data Partitioning along the
Data Axis with the Highest Variance," PWASET vol 26, 2007, pp. 323-
328.
[7] R.O. Duda , and P.E. Hart, Pattern Classification and Scene Analysis.
John Wiley Sons, New York, 1973.
[8] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, "A Density-based
Algorithm for Discovering Clusters in Large Spatial Databases with
Noise," Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining,
AAAI Press, Portland, 1996, pp.226-231.
[9] A. M. Fahim, A. M. Salem ,F. A. Torkey, and M. Ramadan, "An
efficient enhanced k-means clustering algorithm," Journal of Zhejiang
University SCIENCE A, vol 7(10), 2006, pp. 1626-1633.
[10] U.M. Fayyad , G. Piatetsky-Shapiro , P. Smyth , and R. Uthurusamy
Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press,
1996.
[11] A. Gersho, and R.M. Gray Vector Quantization and Signal
Compression, Kluwer Academic, Boston, 1992.
[12] S. Guha , R. Rastogi, and K. Shim, "CURE: An Efficient Clustering
Algorithms for Large Databases," Proc. ACM SIGMOD Int. Conf. on
Management of Data, Seattle, WA, 1998, pp. 73-84.
[13] V. Hautamaeki , S. Cherednichenko , I. Kaerkkaeinen , T. Kinnunen, and
P. Fraenti, "Improving K-Means by Outlier Removal," SCIA 2005,
LNCS 3540, 2005, pp. 978-987.
[14] V. Hautamaeki , I. Kaerkkaeinen, and P. Fraenti, "Outlier detection
using k-nearest neighbourgraph," In: 17th International Conference on
Pattern Recognition (ICPR 2004), Cambridge, United Kingdom, 2004,
pp. 430-433.
[15] A. Hinneburg, and D. Keim, "An Efficient Approach to Clustering in
Large Multimedia Databases with Noise," Proc. 4th Int. Conf. on
Knowledge Discovery and Data Mining, New York City, NY. 1998.
[16] Z. Huang, "A fast clustering algorithm to cluster very large categorical
data sets in data mining," Proceedings of the SIGMOD Workshop on
Research Issues on Data Mining and Knowledge Discovery, Dept. of
Computer Science, The University of British Columbia, Canada, 1997,
pp. 1-8.
[17] A. K. Jain, and R. C. Dubes, Algorithms for Clustering Data, Prentice
Hall, 1988.
[18] L. Kaufman, and P. Rousseeuw, "Finding Groups in Data: An
Introduction to Cluster Analysis," Wiley, 1990.
[19] J.B. MacQueen, "Some methods for classification and analysis of
multivariate observations. Proc. 5th Symp. Mathematical Statistics and
Probability, Berkelely, CA, Vol(1), 1967, pp. 281297.
[20] R.T. Ng, and J. Han, "Efficient and Effective Clustering Methods for
Spatial Data Mining," Proc. 20th Int. Conf. on Very Large Data Bases.
Morgan Kaufmann Publishers, San Francisco, CA, 1994, pp. 144-155.
[21] D. T. Pham , S. S. Dimov, and C. D. Nguyen, "Selection of k in K-means
clustering," Mechanical Engineering Science, vol(219), 2004, pp. 103-
119.
[22] S. Ray, and R. H. Turi, "Determination of number of clusters in k-means
clustering and application in colour image segmentation," In
Proceedings of the 4th International Conference on Advances in Pattern
Recognition and Digital Techniques, 1999, pp.137-143.
[23] G. Sheikholeslami, S. Chatterjee, and A. Zhang, "Wave-Cluster: A
Multi-Resolution Clustering Approach for Very Large Spatial
Databases," Proc. 24th Int. Conf. on Very Large Data Bases. New York,
1998, pp. 428-439.
[24] R. Sibson, "SLINK: an optimally efficient algorithm for the single-link
cluster method," The Comp. Journal, 16(1), 1973, pp. 30-34.
[25] T. Zhang, R. Ramakrishnan, and M. Linvy, "BIRCH: An Efficient Data
Clustering Method for Very Large Databases," Proc. ACM SIGMOD
Int. Conf. on Management of Data. ACM Press, New York, 1996, pp.
103-114.