Improved K-Modes for Categorical Clustering Using Weighted Dissimilarity Measure

K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with K-Modes and K-representative, which show that the new measure generates clusters with high purity.





References:
[1] Arun.K.Pujari, "Data Mining Techniques", Universities Press, 2001.
[2] Daniel Barbara, Julia Couto, Yi Li, "COOLCAT An entropy based algorithm
for categorical clustering", Proceedings of the eleventh international
conference on Information and knowledge management, 2002, 582 - 589.
[3] Dae-won kim, Kwang H.Lee, Doheon Lee, "Fuzzy clustering of categorical
data using centroids", Pattern recognition letters 25, Elseveir, (2004),
1263-1271.
[4] George Karypis, Eui-Hong (Sam) Han, Vipinkumar, "CHAMELEON:
A hierarchical clustering algorithm using dynamic modeling", IEEE
Computer, 1999.
[5] Jiawei Han, Micheline Kamber, "Data Mining Concepts and Techniques",
Harcourt India Private Limited, 2001.
[6] Ohn Mar San, Van-Nam Huynh, Yoshiteru Nakamori, "An Alternative
Extension of The K-Means algorithm For Clustering Categorical Data",
J. Appl. Math. Comput. Sci, Vol. 14, No. 2, 2004, 241-247.
[7] Pavel Berkhin, "Survey of Clustering Data Mining Techniques", Technical
report, Accrue software,2002
[8] Periklis Andristos, Clustering Categorical Data based On Information
Loss Minimization, EDBT 2004: 123-146.
[9] Sudipto Guga, Rajeev Rastogi, Kyuseok Shim, "ROCK, A Robust Clustering
Algorithm For Categorical Attributes", ICDE -99: Proceedings
of the 15th International Conference on Data Engineering, 512, IEEE
Computer Society, Washington, DC, USA,1999
[10] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, "CACTUS
-Clustering Categorical Data using summaries", In Proc. of ACM
SIGKDD, International Conference on Knowledge Discovery and Data
Mining, 1999, San Diego, CA USA.
[11] www.ics.uci.edu/ mlearn/MLRepository.html
[12] Zengyou He, Xiaofei Xu, Shengchun Deng, "Squeezer: An Efficient
algorithm for clustering categorical data", Journal of Computer Science
and Technology, Volume 17 Issue 5, Editorial Universitaria de Buenos
Aires, 2002.
[13] Zengyou He, Xiaofei Xu, Shengchun Deng, Bin Dong," K-Histograms:
An Efficient Algorithm for Catgorical Data set", www.citebase.org.
[14] Zhexue Huang , "A Fast Clustering Algorithm to cluster Very Large
Categorical Datasets in Data Mining", In Proc. SIGMOD Workshop on
Research Issues on Data Mining and Knowledge Discovery, 1997.
[15] Zhexue Huang, "Extensions to the K-means algorithm for clustering
Large Data sets with categorical value", Data Mining and Knowledge
Discovery 2, Kluwer Academic publishers, 1998. 283-304.