A Similarity Measure for Clustering and its Applications
This paper introduces a measure of similarity between
two clusterings of the same dataset produced by two different
algorithms, or even the same algorithm (K-means, for instance, with
different initializations usually produce different results in clustering
the same dataset). We then apply the measure to calculate the
similarity between pairs of clusterings, with special interest directed
at comparing the similarity between various machine clusterings and
human clustering of datasets. The similarity measure thus can be used
to identify the best (in terms of most similar to human) clustering
algorithm for a specific problem at hand. Experimental results
pertaining to the text categorization problem of a Portuguese corpus
(wherein a translation-into-English approach is used) are presented, as well as results on the well-known benchmark IRIS dataset. The
significance and other potential applications of the proposed measure
are discussed.
[1] F. Sebastiani, "Machine Learning in Automated Text Categorization,"
ACM Computing Surveys, 2002, vol 34, No. 1, pp. 1-47.
[2] C. J. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales, L. M.Hage, and
W. E. Hammond, "Medical Data Mining: Knowledge Discovery in a
Clinical Data Warehouse," American Medical Informatics Association
Annual Fall Symposium (formerly SCAMC), 1997, pp. 101-5.
[3] K. Seki and J. Mostafa, "An Application of Text Categorization Methods to Gene Ontology Annotation," Proceedings of the 28th
Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2005, pp. 138-145.
[4] M. Matteucci. (2008). A Tutorial on Clustering Algorithms. Available:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/.
[5] Y. Pen, G. Kou, Y. Shi, and Z. Chen, "Improving Clustering Analysis
for Credit Card Accounts Classification," LNCS 3516, 2005, pp. 548-553.
[6] A. Kalton, K. Wagstaff, and J. Yoo, "Generalized Clustering,
Supervised Learning, and Data Assignment," Proceedings of the
Seventh International Conference on Knowledge Discovery and Data
Mining, ACM Press, 2001.
[7] T, Kardi. (2008). Similarity Measurement. Available:
http://people.revoledu.com/kardi\/tutorial/Similarity/.
[8] M. K. Sankarapani, R. B. Basnet, S. Mukkamala, A. H. Sung, and B.
Ribeiro, "Translation Based Arabic Text Categorization," Proceedings of Second International Conference on Information Systems Technology and Management, Dubai, March 2008.
[9] Linguateca. (2007). Linguateca. Available: http://www.linguateca.pt
/Repositorio/.
[10] Google. (2008). Google Translate. Available: http://translate
.google.com/translate_t.
[11] A. Asuncion and D. J. Newman. (2007). UCI Machine Learning
Repository: Iris Data Set. Avaialable:
http://www.ics.uci.edu/~mlearn/MLRepository.html.
[12] C. Liao, S. Alpha, and P. Dixon, "Feature Preparation in Text
Categorization," Procedings of the Autralasian Data Mining Workshop,
Canberra, Australia, 2003.
[13] M. F. Porter, "An Algorithm for Suffix Stripping, Readings in
Information Retrieval," Morgan Kaufmann Publishers Inc, 1997.
[14] M. Lan, S.-Y Sung, H.-B. Low, and C.-L. Tan, "A Comparative Study
on Term Weighting Schemes for Text Categorization," IJCNN, 2005,
vol. 1, pp. 542-545.
[15] C. Liao, S. Alpha, and P. Dixon, "Feature Preparation in Text Categorization," Procedings of the Autralasian Data Mining Workshop,
Canberra, Australia, 2003.
[16] G. Karypis. (2008). gCLUTO - Graphical Clustering Toolkit | Karypis
Lab. Available:
http://glaros.dtc.umn.edu/gkhome/cluto/gcluto/overview.
[17] J. Abonyi and B. Balasko, B. (2008). Fuzzy Clustering and Data
Analysis Toolbox. Available:
http://www.fmt.vein.hu/softcomp/fclusttoolbox/.
[18] University of Waikato. (2008). Weka 3 -Data Mining with Open Source
Machine Learning Sofware in Java. Available:
http://cs.waikato.ac.nz/~ml/weka/.
[1] F. Sebastiani, "Machine Learning in Automated Text Categorization,"
ACM Computing Surveys, 2002, vol 34, No. 1, pp. 1-47.
[2] C. J. Prather, D. F. Lobach, L. K. Goodwin, J. W. Hales, L. M.Hage, and
W. E. Hammond, "Medical Data Mining: Knowledge Discovery in a
Clinical Data Warehouse," American Medical Informatics Association
Annual Fall Symposium (formerly SCAMC), 1997, pp. 101-5.
[3] K. Seki and J. Mostafa, "An Application of Text Categorization Methods to Gene Ontology Annotation," Proceedings of the 28th
Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2005, pp. 138-145.
[4] M. Matteucci. (2008). A Tutorial on Clustering Algorithms. Available:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/.
[5] Y. Pen, G. Kou, Y. Shi, and Z. Chen, "Improving Clustering Analysis
for Credit Card Accounts Classification," LNCS 3516, 2005, pp. 548-553.
[6] A. Kalton, K. Wagstaff, and J. Yoo, "Generalized Clustering,
Supervised Learning, and Data Assignment," Proceedings of the
Seventh International Conference on Knowledge Discovery and Data
Mining, ACM Press, 2001.
[7] T, Kardi. (2008). Similarity Measurement. Available:
http://people.revoledu.com/kardi\/tutorial/Similarity/.
[8] M. K. Sankarapani, R. B. Basnet, S. Mukkamala, A. H. Sung, and B.
Ribeiro, "Translation Based Arabic Text Categorization," Proceedings of Second International Conference on Information Systems Technology and Management, Dubai, March 2008.
[9] Linguateca. (2007). Linguateca. Available: http://www.linguateca.pt
/Repositorio/.
[10] Google. (2008). Google Translate. Available: http://translate
.google.com/translate_t.
[11] A. Asuncion and D. J. Newman. (2007). UCI Machine Learning
Repository: Iris Data Set. Avaialable:
http://www.ics.uci.edu/~mlearn/MLRepository.html.
[12] C. Liao, S. Alpha, and P. Dixon, "Feature Preparation in Text
Categorization," Procedings of the Autralasian Data Mining Workshop,
Canberra, Australia, 2003.
[13] M. F. Porter, "An Algorithm for Suffix Stripping, Readings in
Information Retrieval," Morgan Kaufmann Publishers Inc, 1997.
[14] M. Lan, S.-Y Sung, H.-B. Low, and C.-L. Tan, "A Comparative Study
on Term Weighting Schemes for Text Categorization," IJCNN, 2005,
vol. 1, pp. 542-545.
[15] C. Liao, S. Alpha, and P. Dixon, "Feature Preparation in Text Categorization," Procedings of the Autralasian Data Mining Workshop,
Canberra, Australia, 2003.
[16] G. Karypis. (2008). gCLUTO - Graphical Clustering Toolkit | Karypis
Lab. Available:
http://glaros.dtc.umn.edu/gkhome/cluto/gcluto/overview.
[17] J. Abonyi and B. Balasko, B. (2008). Fuzzy Clustering and Data
Analysis Toolbox. Available:
http://www.fmt.vein.hu/softcomp/fclusttoolbox/.
[18] University of Waikato. (2008). Weka 3 -Data Mining with Open Source
Machine Learning Sofware in Java. Available:
http://cs.waikato.ac.nz/~ml/weka/.
@article{"International Journal of Information, Control and Computer Sciences:58383", author = "Guadalupe J. Torres and Ram B. Basnet and Andrew H. Sung and Srinivas Mukkamala and Bernardete M. Ribeiro", title = "A Similarity Measure for Clustering and its Applications", abstract = "This paper introduces a measure of similarity between
two clusterings of the same dataset produced by two different
algorithms, or even the same algorithm (K-means, for instance, with
different initializations usually produce different results in clustering
the same dataset). We then apply the measure to calculate the
similarity between pairs of clusterings, with special interest directed
at comparing the similarity between various machine clusterings and
human clustering of datasets. The similarity measure thus can be used
to identify the best (in terms of most similar to human) clustering
algorithm for a specific problem at hand. Experimental results
pertaining to the text categorization problem of a Portuguese corpus
(wherein a translation-into-English approach is used) are presented, as well as results on the well-known benchmark IRIS dataset. The
significance and other potential applications of the proposed measure
are discussed.", keywords = "Clustering Algorithms, Clustering Applications, Similarity Measures, Text Clustering", volume = "2", number = "5", pages = "1587-7", }