Using Suffix Tree Document Representation in Hierarchical Agglomerative Clustering
In text categorization problem the most used method
for documents representation is based on words frequency vectors
called VSM (Vector Space Model). This representation is based only
on words from documents and in this case loses any “word context"
information found in the document. In this article we make a
comparison between the classical method of document representation
and a method called Suffix Tree Document Model (STDM) that is
based on representing documents in the Suffix Tree format. For the
STDM model we proposed a new approach for documents
representation and a new formula for computing the similarity
between two documents. Thus we propose to build the suffix tree
only for any two documents at a time. This approach is faster, it has
lower memory consumption and use entire document representation
without using methods for disposing nodes. Also for this method is
proposed a formula for computing the similarity between documents,
which improves substantially the clustering quality. This
representation method was validated using HAC - Hierarchical
Agglomerative Clustering. In this context we experiment also the
stemming influence in the document preprocessing step and highlight
the difference between similarity or dissimilarity measures to find
“closer" documents.
[1] S. Chakrabarti, Mining the Web- Discovering Knowledge from
hypertext data, Morgan Kaufmann Press, 2003.
[2] Kaufman, L. and Rousseeuw, P.J. Finding Groups in Data: An
Introduction to Cluster Analysis, Wiley-Interscience, New York (Series
in Applied Probability and Statistics), 1990
[3] Manning, C., Raghavan, P., Sch├╝tze, H. Introduction to Information
Retrieval, Cambridge University Press, ISBN 978-0-521-86571, 2008
[4] Meyer,S., Stein, B., Potthast, M., The Suffix Tree Document Model
Revisited, Proceedings of the I-KNOW 05, 5th International Conference
on Knowlegdge Management, Journal of Universal Computer Science,
pp.596-603, Graz, 2005
[5] http://feeds.bbci.co.uk/news/rss.xml
[6] http://www.reuters.com/tools/rss
[7] Salton, G., Wong, A., Yang, C. S., A vector space model for information
retrieval. Communications of the ACM, 18(11), 1975.
[8] Janruang, J. Guha, S., Semantic Suffix Tree Clustering, In Proceedings
of 2011 International Conference on Data Engineering and Internet
Technology (DEIT 2011), Bali, Indonesia, 2011.
[9] Morariu, D., Text Mining Methods based on Support Vector Machine,
MatrixRom, Bucharest, 2008.
[1] S. Chakrabarti, Mining the Web- Discovering Knowledge from
hypertext data, Morgan Kaufmann Press, 2003.
[2] Kaufman, L. and Rousseeuw, P.J. Finding Groups in Data: An
Introduction to Cluster Analysis, Wiley-Interscience, New York (Series
in Applied Probability and Statistics), 1990
[3] Manning, C., Raghavan, P., Sch├╝tze, H. Introduction to Information
Retrieval, Cambridge University Press, ISBN 978-0-521-86571, 2008
[4] Meyer,S., Stein, B., Potthast, M., The Suffix Tree Document Model
Revisited, Proceedings of the I-KNOW 05, 5th International Conference
on Knowlegdge Management, Journal of Universal Computer Science,
pp.596-603, Graz, 2005
[5] http://feeds.bbci.co.uk/news/rss.xml
[6] http://www.reuters.com/tools/rss
[7] Salton, G., Wong, A., Yang, C. S., A vector space model for information
retrieval. Communications of the ACM, 18(11), 1975.
[8] Janruang, J. Guha, S., Semantic Suffix Tree Clustering, In Proceedings
of 2011 International Conference on Data Engineering and Internet
Technology (DEIT 2011), Bali, Indonesia, 2011.
[9] Morariu, D., Text Mining Methods based on Support Vector Machine,
MatrixRom, Bucharest, 2008.
@article{"International Journal of Information, Control and Computer Sciences:56846", author = "Daniel I. Morariu and Radu G. Cretulescu and Lucian N. Vintan", title = "Using Suffix Tree Document Representation in Hierarchical Agglomerative Clustering", abstract = "In text categorization problem the most used method
for documents representation is based on words frequency vectors
called VSM (Vector Space Model). This representation is based only
on words from documents and in this case loses any “word context"
information found in the document. In this article we make a
comparison between the classical method of document representation
and a method called Suffix Tree Document Model (STDM) that is
based on representing documents in the Suffix Tree format. For the
STDM model we proposed a new approach for documents
representation and a new formula for computing the similarity
between two documents. Thus we propose to build the suffix tree
only for any two documents at a time. This approach is faster, it has
lower memory consumption and use entire document representation
without using methods for disposing nodes. Also for this method is
proposed a formula for computing the similarity between documents,
which improves substantially the clustering quality. This
representation method was validated using HAC - Hierarchical
Agglomerative Clustering. In this context we experiment also the
stemming influence in the document preprocessing step and highlight
the difference between similarity or dissimilarity measures to find
“closer" documents.", keywords = "Text Clustering, Suffix tree documentrepresentation, Hierarchical Agglomerative Clustering", volume = "5", number = "11", pages = "1315-6", }