A Similarity Measure for Clustering and its Applications

This paper introduces a measure of similarity between two clusterings of the same dataset produced by two different algorithms, or even the same algorithm (K-means, for instance, with different initializations usually produce different results in clustering the same dataset). We then apply the measure to calculate the similarity between pairs of clusterings, with special interest directed at comparing the similarity between various machine clusterings and human clustering of datasets. The similarity measure thus can be used to identify the best (in terms of most similar to human) clustering algorithm for a specific problem at hand. Experimental results pertaining to the text categorization problem of a Portuguese corpus (wherein a translation-into-English approach is used) are presented, as well as results on the well-known benchmark IRIS dataset. The significance and other potential applications of the proposed measure are discussed.




References:
[1] F. Sebastiani, "Machine Learning in Automated Text Categorization,"
ACM Computing Surveys, 2002, vol 34, No. 1, pp. 1-47.
[2] C. J. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales, L. M.Hage, and
W. E. Hammond, "Medical Data Mining: Knowledge Discovery in a
Clinical Data Warehouse," American Medical Informatics Association
Annual Fall Symposium (formerly SCAMC), 1997, pp. 101-5.
[3] K. Seki and J. Mostafa, "An Application of Text Categorization Methods to Gene Ontology Annotation," Proceedings of the 28th
Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2005, pp. 138-145.
[4] M. Matteucci. (2008). A Tutorial on Clustering Algorithms. Available:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/.
[5] Y. Pen, G. Kou, Y. Shi, and Z. Chen, "Improving Clustering Analysis
for Credit Card Accounts Classification," LNCS 3516, 2005, pp. 548-553.
[6] A. Kalton, K. Wagstaff, and J. Yoo, "Generalized Clustering,
Supervised Learning, and Data Assignment," Proceedings of the
Seventh International Conference on Knowledge Discovery and Data
Mining, ACM Press, 2001.
[7] T, Kardi. (2008). Similarity Measurement. Available:
http://people.revoledu.com/kardi\/tutorial/Similarity/.
[8] M. K. Sankarapani, R. B. Basnet, S. Mukkamala, A. H. Sung, and B.
Ribeiro, "Translation Based Arabic Text Categorization," Proceedings of Second International Conference on Information Systems Technology and Management, Dubai, March 2008.
[9] Linguateca. (2007). Linguateca. Available: http://www.linguateca.pt
/Repositorio/.
[10] Google. (2008). Google Translate. Available: http://translate
.google.com/translate_t.
[11] A. Asuncion and D. J. Newman. (2007). UCI Machine Learning
Repository: Iris Data Set. Avaialable:
http://www.ics.uci.edu/~mlearn/MLRepository.html.
[12] C. Liao, S. Alpha, and P. Dixon, "Feature Preparation in Text
Categorization," Procedings of the Autralasian Data Mining Workshop,
Canberra, Australia, 2003.
[13] M. F. Porter, "An Algorithm for Suffix Stripping, Readings in
Information Retrieval," Morgan Kaufmann Publishers Inc, 1997.
[14] M. Lan, S.-Y Sung, H.-B. Low, and C.-L. Tan, "A Comparative Study
on Term Weighting Schemes for Text Categorization," IJCNN, 2005,
vol. 1, pp. 542-545.
[15] C. Liao, S. Alpha, and P. Dixon, "Feature Preparation in Text Categorization," Procedings of the Autralasian Data Mining Workshop,
Canberra, Australia, 2003.
[16] G. Karypis. (2008). gCLUTO - Graphical Clustering Toolkit | Karypis
Lab. Available:
http://glaros.dtc.umn.edu/gkhome/cluto/gcluto/overview.
[17] J. Abonyi and B. Balasko, B. (2008). Fuzzy Clustering and Data
Analysis Toolbox. Available:
http://www.fmt.vein.hu/softcomp/fclusttoolbox/.
[18] University of Waikato. (2008). Weka 3 -Data Mining with Open Source
Machine Learning Sofware in Java. Available:
http://cs.waikato.ac.nz/~ml/weka/.