Clustering Categorical Data Using Hierarchies (CLUCDUH)

Clustering large populations is an important problem when the data contain noise and different shapes. A good clustering algorithm or approach should be efficient enough to detect clusters sensitively. Besides space complexity, time complexity also gains importance as the size grows. Using hierarchies we developed a new algorithm to split attributes according to the values they have and choosing the dimension for splitting so as to divide the database roughly into equal parts as much as possible. At each node we calculate some certain descriptive statistical features of the data which reside and by pruning we generate the natural clusters with a complexity of O(n).




References:
[1] Raymond T Ng. & Jiawei Han. (1994). Efficient and Effective
Clustering Methods for Spatial Data Mining, Proceedings of 20th
International Conference on Very Large Data Bases, Santiago de Chile,
(pp. 144 - 155). Morgan Kauffmann.
[2] Ester Martin, et. al. (1996). A Density Based Algorithm for Discovering
Clusters in LargeSpatial Databases with Noise, Proceedings of 2nd
International Conference on Knowledge Discovery and Data Mining
(pp. 169- 194). Kluwer Academic Publishers. ]
[3] Ankerst Mihael, et.al. (1999) OPTICS: Ordering Points to Identify the
Clustering Structure, Proceedings of ACM SIGMOD (pp. 5761 -5767).
Pergamon Press.
[4] Hinneburg, Alexander and Keim, Daniel A. (1998). An Efficient
Approach to Clustering in Large Multimedia Databases with Noise,
Proceedings of Knowledge Discovery and Data Mining (pp. 58 -65).
AAAI Press.
[5] Han J., & Kamber, Micheline. (2001). Data Mining Concepts and
Techniques, Morgan Kaufman Publishers Academic Press.
[6] Karypis, George, et.al. (1999). CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling, Poceedings of IEEE
COMPUTER, V.32, (pp. 68 - 75). IEEE Computer Society Press.
[7] Duda R. & Hart P. E. (1973). Pattern Classification and Scene Analysis,
Wilry.
[8] Kauffman, L., & Rousseeuw P.J. (1990), Finding Groups in Data: An
Introduction to Cluster Analysis, John Wiley and Sons.
[9] Fisher Douglas H.(1995). Iteraritive Optimisation and Simplification of
Hierarchical Clusterings, Technical Report CS-95-01, Vabderbilt
University.
[10] Fausett L. (1994). Fundamentals of Neural Networks, Prentice-Hall,
New Jersey.
[11] Maulik U. & Sanghamitra B.(2000). Genetic Algorithm-based clustering
technique, Journal of the Pattern Recognition, Pergamon, issue: 33.
[12] Zhang Tian et.al. (1996). BIRCH: An Efficient Data Clustering Method
for Very Large Databases, Proceedings of ACM International
Conference on Management of Data, (409 - 418). Oxford University
Press.
[13] Kreyzig E.(1989). Introductory Functional Analysis With Applications,
Wiley.
[14] Bill F. (Ed.) (1992). Information retrieval: data structures & algorithms.
Prentice Hall.
[15] Mitchell T.(1997). Machine Learning, McGraw-Hill International.
[16] Quinlan,J.Ross. (1987). Simplifying decision trees, International Journal
of Man-Machine Studies,issue: 27(3), (pp. 221 - 234).
[17] Breiman L., & Friedman J. H., & Olshen R. A., & Stone C. J. (1984).
Classification and Regression Trees, Wadsworth, Belmont.
[18] Mehta M., & Agrawal R., & Rissanen J. (1996). SLIQ: A Fast Scalable
Classifier for Data Mining, Proceedings of 5th International Extending
Database Technology Conference.France. (pp. 18-32). Springer-Verlag,
London.
[19] Agrawal R. & Shafer J.C. (1996). Parallel Mining of Association Rules,
Proceedings. of IEEE Transactions on Knowledge and Data
Engineering, Vol. 8, No. 6. (962- 969). IEEE Educational Activities
Department. USA.
[20] Hettich, S. , & Bay, S. D. (1999). The UCI KDD Archive, Department of
Information and Computer Science, University of California, Irvine, CA.
Retrieved September 1, 2008, from http://kdd.ics.uci.edu.
[21] Pham D.T., & Chan A.B.(1998). Control Chart Pattern Recognition
using a New Type of Self Organizing Neural Network. Proceedings of
the Institution of Mechanical Engineers, Part I: Journal of Systems and
Control Engineering. Vol 212, No 1, (pp. 115-127). Professional
Engineering Publishing.
[22] Keogh, E. & Pazzani, M. (2001). Derivative Dynamic Time Warping. In
First SIAM International Conference on Data Mining (SDM'2001),
Chicago, USA.
[23] Alcock R.J. & Manolopoulos Y. (1999). Time-Series Similarity Queries
Employing a Feature-Based Approach. 7th Hellenic Conference on
Informatics. Ioannina,Greece.