Abstract: Clustering categorical data is more complicated than
the numerical clustering because of its special properties. Scalability
and memory constraint is the challenging problem in clustering large
data set. This paper presents an incremental algorithm to cluster the
categorical data. Frequencies of attribute values contribute much in
clustering similar categorical objects. In this paper we propose new
similarity measures based on the frequencies of attribute values and
its cardinalities. The proposed measures and the algorithm are
experimented with the data sets from UCI data repository. Results
prove that the proposed method generates better clusters than the
existing one.
Abstract: K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with K-Modes and K-representative, which show that the new measure generates clusters with high purity.