Incremental Algorithm to Cluster the Categorical Data with Frequency Based Similarity Measure

Clustering categorical data is more complicated than the numerical clustering because of its special properties. Scalability and memory constraint is the challenging problem in clustering large data set. This paper presents an incremental algorithm to cluster the categorical data. Frequencies of attribute values contribute much in clustering similar categorical objects. In this paper we propose new similarity measures based on the frequencies of attribute values and its cardinalities. The proposed measures and the algorithm are experimented with the data sets from UCI data repository. Results prove that the proposed method generates better clusters than the existing one.




References:
[1] Aranganayagi.S and K.Thangavel, "M-Squeezer Algorithm to
Cluster the Categorical Data", Computational Mathematics,
Narosa, Publishing House, New Delhi, India, 2009
[2] Aranganayagi.S and K.Thangavel, "Improved K-Modes for
Categorical Clustering Using Weighted Dissimilarity Measure",
International Journal of Computational Intelligence (IJCI), Vol.5,
No.2, pp.182-189,WASET, spring 2009.
[3] Arun.K.Pujari, "Data Mining Techniques", University Press, 2001.
[4] Ching- San Chiang, Shu-Chuan Chu, Yi-Chih Hsin and Ming-Hui
Wang, "Genetic Distance measure for K-modes Algorithm",
International Journal of Innovative Computing and Information and
Control, Vol.2 , 2006, pp 33 -40.
[5] Daniel Barbara, Julia Couto, Yi Li, "COOLCAT An entropy based
algorithm for categorical clustering", Proceedings of the eleventh
international conference on Information and knowledge management,
2002, 582 - 589.
[6] Dae-won kim, Kwang H.Lee, Doheon Lee, "Fuzzy clustering of
categorical data using centroids", Pattern recognition letters 25,
Elseveir, (2004), 1263-1271.
[7] Dutta, M. and Mahanta, A. Kakoti and Pujari, Arun K., "QROCK a
quick version of the ROCK algorithm for clustering of categorical
data, Pattern Recogn. Letters, volume = {26}, 2005, 2364 - 2373,
Elsevier Science Inc
[8] Hsu.C.C., & Huang,Y.P., "Incremental Clustering of Mixed Data
Based on the Distance Hierarchy", Expert System with
Applications,(2007),doi:10.1016/j/eswa 2007.08.049
[9] Jiawei Han, Micheline Kamber, "Data Mining Concepts and
Techniques", Harcourt India Private Limited, 2001.
[10] Ohn Mar San, Van-Nam Huynh, Yoshiteru Nakamori, "An
Alternative Extension of The K-Means algorithm For Clustering
Categorical Data", J. Appl. Math. Comput. Sci, Vol. 14, No. 2,
2004, 241-247.
[11] Periklis Andristos, "Clustering Categorical Data based On
Information Loss Minimization", EDBT 2004: 123-146.
[12] Sudipto Guga, Rajeev Rastogi, Kyuseok Shim, "ROCK, A Robust
Clustering Algorithm For Categorical Attributes", ICDE '99:
Proceedings of the 15th International Conference on Data
Engineering, 512, IEEE Computer Society, Washington, DC,
USA,1999
[13] Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan,
"CACTUS -Clustering Categorical Data using summaries", In Proc.
of ACM SIGKDD, International Conference on Knowledge
Discovery & Data Mining, 1999, San Diego, CA USA.
[14] www.ics.uci.edu/~mlearn/MLRepository.html
[15] Zengyou He, Xiaofei Xu, Shengchun Deng, "Squeezer: An Efficient
algorithm for clustering categorical data", Journal of Computer
Science and Technology, Volume 17 Issue 5, Editorial
Universitaria de Buenos Aires, 2002.
[16] Zengyou He, Xiaofei Xu, Shengchun Deng, Bin Dong, "KHistograms:
An Efficient Algorithm for Categorical Data set",
www.citebase.org.
[17] Zhexue Huang , "A Fast Clustering Algorithm to cluster Very Large
Categorical Datasets in Data Mining", In Proc. SIGMOD Workshop
on Research Issues on Data Mining and Knowledge Discovery,
1997.
[18] Zhexue Huang, "Extensions to the K-means algorithm for clustering
Large Data sets with categorical value", Data Mining and
Knowledge Discovery 2, Kluwer Academic publishers, 1998. 283-
304.