Using Suffix Tree Document Representation in Hierarchical Agglomerative Clustering

Scholarly

Volume:5, Issue: 11, 2011 Page No: 1315 - 1320

International Journal of Information, Control and Computer Sciences

ISSN: 2517-9942

1413 Downloads

Abstract Full Text Download References Share Add to Favorites

DOI:10.5281/zenodo.1334383 BibTeX JSON

Using Suffix Tree Document Representation in Hierarchical Agglomerative Clustering

In text categorization problem the most used method for documents representation is based on words frequency vectors called VSM (Vector Space Model). This representation is based only on words from documents and in this case loses any “word context" information found in the document. In this article we make a comparison between the classical method of document representation and a method called Suffix Tree Document Model (STDM) that is based on representing documents in the Suffix Tree format. For the STDM model we proposed a new approach for documents representation and a new formula for computing the similarity between two documents. Thus we propose to build the suffix tree only for any two documents at a time. This approach is faster, it has lower memory consumption and use entire document representation without using methods for disposing nodes. Also for this method is proposed a formula for computing the similarity between documents, which improves substantially the clustering quality. This representation method was validated using HAC - Hierarchical Agglomerative Clustering. In this context we experiment also the stemming influence in the document preprocessing step and highlight the difference between similarity or dissimilarity measures to find “closer" documents.

Authors:

Keywords:

References:

[1] S. Chakrabarti, Mining the Web- Discovering Knowledge from
hypertext data, Morgan Kaufmann Press, 2003.
[2] Kaufman, L. and Rousseeuw, P.J. Finding Groups in Data: An
Introduction to Cluster Analysis, Wiley-Interscience, New York (Series
in Applied Probability and Statistics), 1990
[3] Manning, C., Raghavan, P., Sch├╝tze, H. Introduction to Information
Retrieval, Cambridge University Press, ISBN 978-0-521-86571, 2008
[4] Meyer,S., Stein, B., Potthast, M., The Suffix Tree Document Model
Revisited, Proceedings of the I-KNOW 05, 5th International Conference
on Knowlegdge Management, Journal of Universal Computer Science,
pp.596-603, Graz, 2005
[5] http://feeds.bbci.co.uk/news/rss.xml
[6] http://www.reuters.com/tools/rss
[7] Salton, G., Wong, A., Yang, C. S., A vector space model for information
retrieval. Communications of the ACM, 18(11), 1975.
[8] Janruang, J. Guha, S., Semantic Suffix Tree Clustering, In Proceedings
of 2011 International Conference on Data Engineering and Internet
Technology (DEIT 2011), Bali, Indonesia, 2011.
[9] Morariu, D., Text Mining Methods based on Support Vector Machine,
MatrixRom, Bucharest, 2008.

Scholarly

International Journal of Information, Control and Computer Sciences

Archive

Last Issue

Commitee

Using Suffix Tree Document Representation in Hierarchical Agglomerative Clustering

Scholarly

International Journal of Information, Control and Computer Sciences

Archive

Last Issue

Commitee

Using Suffix Tree Document Representation in Hierarchical Agglomerative Clustering

Preview