A Hybrid Ontology Based Approach for Ranking Documents

Increasing growth of information volume in the internet causes an increasing need to develop new (semi)automatic methods for retrieval of documents and ranking them according to their relevance to the user query. In this paper, after a brief review on ranking models, a new ontology based approach for ranking HTML documents is proposed and evaluated in various circumstances. Our approach is a combination of conceptual, statistical and linguistic methods. This combination reserves the precision of ranking without loosing the speed. Our approach exploits natural language processing techniques to extract phrases from documents and the query and doing stemming on words. Then an ontology based conceptual method will be used to annotate documents and expand the query. To expand a query the spread activation algorithm is improved so that the expansion can be done flexible and in various aspects. The annotated documents and the expanded query will be processed to compute the relevance degree exploiting statistical methods. The outstanding features of our approach are (1) combining conceptual, statistical and linguistic features of documents, (2) expanding the query with its related concepts before comparing to documents, (3) extracting and using both words and phrases to compute relevance degree, (4) improving the spread activation algorithm to do the expansion based on weighted combination of different conceptual relationships and (5) allowing variable document vector dimensions. A ranking system called ORank is developed to implement and test the proposed model. The test results will be included at the end of the paper.




References:
[1] E. Greengrass, "Information Retrieval: A survey". DOD Technical
Report TR-R52-008-001, November 2000.
[2] G. Salton, E. A.Fox, H. Wu, "Extended boolean information retrieval",
Communications of the ACM, Volume 26, No. 11, 1983, Pages: 1022-
1036.
[3] J.H. Lee, "Properties of extended boolean models in information
retrieval". Proceedings of the 17th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval,
1994, Pages: 182-190.
[4] D. L. Lee, H. Chuang, K. Seamons, "Document ranking and the Vector-
Space model". IEEE Software, Volume 14, Issue 2, March 1997, Pages:
67 - 75.
[5] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman,
"Indexing by latent semantic analysis", Journal of the American Society
for Information Science, Volume 41, Issue 6, 1990, Pages: 391-407.
[6] M. E. Maron, J. L. Kuhns, "On relevance, probabilistic indexing and
retrieval".Journal of the ACM, Volume 7, 1960, Pages: 216-244.
[7] F. Crestani, M. Lalmas, J. van Rijsbergen, L. Campbell, "Is this
document relevant? ...probably. A survey of probabilistic models in
information retrieval". ACM Computing Surveys, Volume 30, Issue 4,
December 1998, Pages: 528 - 552.
[8] M. R. Henzinger, "Hyperlink analysis for the web". IEEE Internet
Computing, Volume 5, Issue 1, January 2001, Pages: 45 - 50.
[9] S. Brin, L. Page, "The anatomy of a Large-Scale Hyper-textual web
search engine". Proceedings of the Seventh International World Wide
Web Conference, Elsevier Science, New York, 1998, Pages: 107-117.
[10] R. Baeza-Yates, E. Davis, "Web page ranking using link attributes".
International World Wide Web Conference, Proceedings of the 13th
International World Wide Web Conference on Alternate Track Papers &
Posters, New York, NY, USA, 2004, Pages: 328 - 329.
[11] R. Lempel, S. Moran. "The stochastic approach for link-structure
analysis (SALSA) and the TKC e®ect". In The Ninth International
WWW Conference, May 2000.
[12] D. Vallet, M. Fernández, P. Castells, "An Ontology-Based information
retrieval model". 2nd European Semantic Web Conference (ESWC
2005). Heraklion, Greece, May 2005. Springer Verlag Lecture Notes in
Computer Science, Volume 3532. G├│mez-
Pérez,A.;Euzenat,J.(Eds.),2005, Pages:455-470.
[13] M. Nakashima,Y. Kaneko, T. Ito, "Ranking of documents by measures
considering conceptual dependence between terms". Systems and
Computers in Japan, Volume 34, Issue 5 , 2003, Pages: 81 - 91.
[14] C. Rocha, D. Schwabe, M. Poggi de Aragão, "A hybrid approach for
searching in the semantic web". International World Wide Web
Conference, Proceedings of the 13th international conference on World
Wide Web, 2004, Pages: 374 - 383.
[15] Aeroswarm,http://ubot.lockheedmartin.com/ubot/hotdaml/aeroswarm.ht
ml
[16] LCNetTools,http://itlang/vb.net/archivio.asp?subMenu=Tutte&FullText
on&TypeRi=AND&keyword=LCNettools