Categorical Clustering By Converting Associated Information

Lacking an inherent “natural" dissimilarity measure between objects in categorical dataset presents special difficulties in clustering analysis. However, each categorical attributes from a given dataset provides natural probability and information in the sense of Shannon. In this paper, we proposed a novel method which heuristically converts categorical attributes to numerical values by exploiting such associated information. We conduct an experimental study with real-life categorical dataset. The experiment demonstrates the effectiveness of our approach.




References:
[1] C. C. Aggarwal, A human-computer interactive method for projected
clustering, IEEE Transactions on Knowledge and Data Engineering,
16(4), 448-460, 2004.
[2] M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. OPTICS:
Ordering points to identify the clustering structure. In Proc. 1999 ACMSIGMOD
Int. Conf. Management of Data (SIGMOD'99), pages 49{60,
Philadelphia, PA, June 1999.
[3] M.R. Anderberg, Cluster analysis for applications, Academic Press,
1973.
[4] D. Barbara, Y. Li, J. Couto, COOLCAT: An entropy-based algorithm for
categorical clustering. In: CIKM Conference. McLean, VA, 2002.
[5] C.L. Blake and C.J. Merz, UCI repository of machine learning
databases, 1998. http://www.ics.uci.edu/~mlearn/MLRepository.html
[6] D. Cristofor and D. A. Simovici, An information-theoretical approach to
clustering categorical databases using genetic algorithms. In Proceedings
of the Workshop on Clustering High-Dimensional Data and Its
Applications (SIAM ICDM), pages 37-46, Washington, 2002.
[7] Richard O. Duda and Peter E. Hard, Pattern classification and scene
analysi. A wiley-Interscience Publication, New York, 1973.
[8] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm
for discovering clusters in large spatial databases. In Proc. 1996 Int.
Conf. Knowledge Discovery and Data Mining (KDD'96), pages
226{231, Portland, Oregon, Aug. 1996.
[9] D. Fisher, Improving inference through conceptual clustering. In Proc.
1987 National Conference Artificial Intelligence (AAAI-87), pages 461-
465, Seattle, WA, July 1987.
[10] K.C. Gowda and E. Diday, Symbolic clustering using a new dissimilarity
measure. Pattern Recognition, 24(6): 567-578, 1991.
[11] V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS: Clustering
categorical data using summaries. In ACM SIGKDD Int-l Conference on
Knowledge discovery in Databases, 1999.
[12] David Gibson, Jon Kleiberg, Prabhakar Raghavan: Clustering
categorical data: an approach based on dynamic systems". Proc. 1998
Int. Conf. On Very Large Databases, pp. 311-323, New York, August
1998.
[13] J.C. Gower, A general coefficient of similarity and some of its
properties. BioMetrics, 27: 857-874, 1971.
[14] Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, ROCK: A robust
clustering algorithm for categorical attributes. ICDE 1999: 512-521.
[15] A. Hinneburg and D. A. Keim. An efficient approach to clustering in
large multimedia databases with noise. In Proc. 1998 Int. Conf.
Knowledge Discovery and Data Mining (KDD'98), pages 58-65, New
York, NY, Aug. 1998.
[16] J. Han and M. Kamber, Data mining: concepts and techniques, Morgan
Kaufmann publishers, 2001.
[17] Z. Huang, Extensions to the k-means algorithm for clustering large data
sets with categorical values, Data Mining and Knowledge Discovery,
vol. 2, no. 3, pp 283-304, 1998.
[18] A.K. Jain and R.C. Dubes, Algorithms for clustering data, Rentice Hall,
1988.
[19] L. Kaufman and P.J. Rousseeuw, Finding groups in data - An
Introduction to Cluster Analysis in Knowledge, 1990.
[20] Lioyd. Learning square quantization in PCM. (published in IEEE Trans.
Information Theory), 28:128-137, 1982), Technical Report, Bell Labs,
1957.
[21] Tao Li, Sheng Ma, Mitsunori Ogihara, Entropy-based criterion in
categorical clustering. In Proceedings of The 2004, IEEE International
Conference on Machine Learning (ICML 2004), pages 536-543.
[22] J. MacQueen. Some methods for classi¯cation and analysis of
multivariate observations. Proc. 5th Berkeley Symp. Math. Statist, Prob.,
1:281-297, 1967.
[23] R.S. Michalski and R.E. Stephen, Automated construction of
classification: conceptual clustering versus numerical taxonomy. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 5(4): 396-
410, 1983.
[24] J.R. Quinlan, Induction of decision trees, Machine Learning, vol. 1, no.
1, pp. 81-106, 1986.
[25] J.R. Quinlan, C4.5: Programs for machine learning. Morgan Kaufmann,
1993.
[26] H. Ralambondrainy, A conceptual version of the k-means algorithm.
Pattern Recognition Letters, 16:1147-1157, 1995.
[27] Claude. E. Shannon, A mathematical theory of communication, Bell
System Technical Journal, vol.27, pp. 379-423 and 623-656, July and
October, 1948.
[28] M. Steinbach, G. Karypis, and V. Kumar, A comparison of document
clustering techniques, In KDD workshop on Text Mining, 2000.
[29] L. Talavera and J. Béjar, Intergrating declarative knowledge in
hierarchical clustering tasks. Proceedings of the International
Symposium on Intelligent Data Analysis, pp. 211-222, Amsterdam, The
Netherlands: Springer-Verlag, 1999.
[30] Y. Zhang, A. Fu, C. Cai, and P. Heng, Clustering categorical data, In
Proc. 2000 IEEE Int. Conf. Data Engineering, San Deigo, USA, March
2000.