Graph-Based Text Similarity Measurement by Exploiting Wikipedia as Background Knowledge

Text similarity measurement is a fundamental issue in many textual applications such as document clustering, classification, summarization and question answering. However, prevailing approaches based on Vector Space Model (VSM) more or less suffer from the limitation of Bag of Words (BOW), which ignores the semantic relationship among words. Enriching document representation with background knowledge from Wikipedia is proven to be an effective way to solve this problem, but most existing methods still cannot avoid similar flaws of BOW in a new vector space. In this paper, we propose a novel text similarity measurement which goes beyond VSM and can find semantic affinity between documents. Specifically, it is a unified graph model that exploits Wikipedia as background knowledge and synthesizes both document representation and similarity computation. The experimental results on two different datasets show that our approach significantly improves VSM-based methods in both text clustering and classification.




References:
[1] E.Gabrilovich andS.Markovitch, "Overcoming the brittleness bottleneck
using Wikipedia: enhancing text categorization with encyclopedic
knowledge,"inProceedings of the 21st National Conference on Artificial
Intelligence, Boston,2006, pp. 787-788.
[2] E.Gabrilovich andS.Markovitch, "Computing semantic relatedness using
Wikipedia-based explicit semantic analysis,"inProceedings of the 20th
International Joint Conference on Artificial Intelligence, Hyderabad,
2007, pp. 1606-1611.
[3] P.Wang andC.Domeniconi, "Building semantic kernels for text classification
using Wikipedia,"inProceeding of the 14th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Las
Vegas, 2008, pp. 713-721.
[4] S.Banerjee, K.Ramanathanand A.Gupta, "Clustering short texts using
Wikipedia,"inProceedings of the 30th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval,
Amsterdam, 2007, pp. 787-788.
[5] J.Hu, L.Fang,Y.Cao, et al.,"Enhancing text clustering by leveraging
Wikipedia semantics,"inProceedings of the 31st Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval, Singapore, 2008, pp. 179-186.
[6] X.Hu, X.Zhang, C.Lu,E. K. Park and X. Zhou,"Exploiting Wikipedia as
external knowledge for document clustering,"inProceedings of the 15th
ACM SIGKDD Conference on Knowledge Discovery and Data Mining,
Paris, 2009, pp. 389-396.
[7] Y.Miao andC.Li, "Enhancing query-oriented summarization based on
sentence wikification,"inWorkshop of the 33rd Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval, 2010.
[8] Y.Li,W.P.R.Luk,K.S.E.Ho and F.L.K. Chung,"Improving weak ad-hoc
queries using Wikipedia as external corpus,"inProceedings of the 30th
Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, Amsterdam, 2007, pp. 797-798.
[9] Y.Miao andC.Li, "Mining Wikipedia and Yahoo! Answers for question
expansion in opinion QA,"inAdvances in Knowledge Discovery and Data
Mining, vol. 6118/2010, pp. 367-374. Springer, 2010.
[10] G.Jeh andJ.Widom, "SimRank: Ameasure of structural-context similarity,"
in: Proceedings of the 8th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, Edmonton,2002, pp.
538-543.
[11] I.Antonellis, H.Garcia-Molina andC.-C.Chang, "Simrank++: Query
rewriting through link analysis of the click graph,"in Proceedings of the
Very Large Databases,vol.1, iss.1, pp. 408-421, 2008.
[12] D.Lizorkin, P.Velikhov, M.Grinev andD.Turdakov, "Accuracy estimate
and optimization techniques for Simrankcomputation,"inProceedings of
the Very Large Databases,vol.1, iss.1, pp.422-433, 2008.
[13] S.Zhong andJ.Ghosh, "Generative model-based document clustering: A
comparative study,"inKnowledge and Information Systems, vol.8, no.3,
pp.374-384, Springer, 2005.
[14] H.Small, "Co-citation in the scientific literature: A newmeasure of the
relationship between two documents,"Journal of American Society for
Information Science,vol.24,iss.4,pp. 265-269, 1973.
[15] A.Hotho, S.Staab andG.Stumme, "Wordnet improves text document
clustering,"inSemantic Web Workshop of the 26th Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval, 2003.
[16] I.Yoo, X.Hu and I.-Y.Song, "Integration of semantic-based bipartite
graph representation and mutual refinement strategy for biomedical literature
clustering,"inProceedings of the 12th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining,Philadelphia, 2006, pp.
791-796.
[17] L'aszl'o andLov'asz,"Random walks on graphs: A survey,"Bolyai
Society Mathematical Studies, vol.2, pp.1-46, 1993.