ORank: An Ontology Based System for Ranking Documents

Increasing growth of information volume in the internet causes an increasing need to develop new (semi)automatic methods for retrieval of documents and ranking them according to their relevance to the user query. In this paper, after a brief review on ranking models, a new ontology based approach for ranking HTML documents is proposed and evaluated in various circumstances. Our approach is a combination of conceptual, statistical and linguistic methods. This combination reserves the precision of ranking without loosing the speed. Our approach exploits natural language processing techniques for extracting phrases and stemming words. Then an ontology based conceptual method will be used to annotate documents and expand the query. To expand a query the spread activation algorithm is improved so that the expansion can be done in various aspects. The annotated documents and the expanded query will be processed to compute the relevance degree exploiting statistical methods. The outstanding features of our approach are (1) combining conceptual, statistical and linguistic features of documents, (2) expanding the query with its related concepts before comparing to documents, (3) extracting and using both words and phrases to compute relevance degree, (4) improving the spread activation algorithm to do the expansion based on weighted combination of different conceptual relationships and (5) allowing variable document vector dimensions. A ranking system called ORank is developed to implement and test the proposed model. The test results will be included at the end of the paper.




References:
[1] E. Greengrass, "Information Retrieval: A survey". DOD Technical
Report TR-R52-008-001, November 2000.
[2] G. Salton, E. A.Fox, H. Wu, "Extended boolean information retrieval",
Communications of the ACM, Volume 26, No. 11, 1983, Pages: 1022 -
1036.
[3] J.H. Lee, "Properties of extended boolean models in information
retrieval". Proceedings of the 17th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval,
1994, Pages: 182 - 190.
[4] D. L. Lee, H. Chuang, K. Seamons, "Document ranking and the Vector-
Space model". IEEE Software, Volume 14, Issue 2, March 1997, Pages:
67 - 75.
[5] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman,
"Indexing by latent semantic analysis", Journal of the American Society
for Information Science, Volume 41, Issue 6, 1990, Pages: 391- 407.
[6] M. E. Maron, J. L. Kuhns, "On relevance, probabilistic indexing and
retrieval".Journal of the ACM, Volume 7, 1960, Pages: 216 - 244.
[7] F. Crestani, M. Lalmas, J. van Rijsbergen, L. Campbell, "Is this
document relevant? ...probably. A survey of probabilistic models in
information retrieval". ACM Computing Surveys, Volume 30, Issue 4,
December 1998, Pages: 528 - 552.
[8] W.M Shaw, "Term-Relevance computations and perfect retrieval
performance". Information Processing& Management, Volume 31, No.
4, 1995, Pages: 491 - 498.
[9] G. Amati, S. Kerpedjiev, "An information retrieval logical model:
implementation and experiments". Technical Report Rel 5B04892,
Fondazione Ugo Bordoni, Roma, Italy, March 1992.
[10] H. Turtle, W.B. Croft, "Evaluation of an inference network-based
retrieval model". ACM Transactions on Information Systems, Volume
9, No. 3, 1991.
[11] M. R. Henzinger, "Hyperlink analysis for the web". IEEE Internet
Computing, Volume 5, Issue 1, January 2001, Pages: 45 - 50.
[12] S. Brin, L. Page, "The anatomy of a Large-Scale Hyper-textual web
search engine". Proceedings of the Seventh International World Wide
Web Conference, Elsevier Science, New York, 1998, Pages: 107 - 117.
[13] R. Baeza-Yates, E. Davis, "Web page ranking using link attributes".
International World Wide Web Conference, Proceedings of the 13th
International World Wide Web Conference on Alternate Track Papers &
Posters, New York, NY, USA, 2004, Pages: 328 - 329.
[14] R. Lempel, S. Moran. "The stochastic approach for link-structure
analysis (SALSA) and the TKC e®ect". In The Ninth International
WWW Conference, May 2000.
[15] H. Zhuge, L. Zheng, "Ranking Semantic-Linked network". WWW
(Posters), 2003.
[16] D. Vallet, M. Fernández, P. Castells, "An Ontology-Based information
retrieval model". 2nd European Semantic Web Conference (ESWC
2005). Heraklion, Greece, May 2005. Springer Verlag Lecture Notes in
Computer Science, Volume 3532. G├│mez-
Pérez,A.;Euzenat,J.(Eds.),2005, Pages:455-470.
[17] C. Rocha, D. Schwabe, M. Poggi de Aragão, "A hybrid approach for
searching in the semantic web". International World Wide Web
Conference, Proceedings of the 13th international conference on World
Wide Web, 2004, Pages: 374 - 383.
[18] D.A. Grossman, O. Frieder. "Information retrieval algorithms and
heuristics". Second ed. . Springer. 2004.
[19] R. Rada, H. Mili, E. Bicknell, M Blettner, "Development and application
of a metric on semantic nets". IEEE Transactions on System, man, and
Cybernetics, Volume 19, No. 1, Pages: 17 - 30.
[20] Y.W. Kim, J.H. Kim, "A model of knowledge based information
retrieval with hierarchical concept graph". Journal of Documentation,
Volume 46, No. 2, 1998, Pages: 113 - 136.
[21] M. Nakashima,Y. Kaneko, T. Ito, "Ranking of documents by measures
considering conceptual dependence between terms". Systems and
Computers in Japan, Volume 34, Issue 5, 2003, Pages: 81 - 91.
[22] J.M. Ponte, W.B. Croft, "A language modeling approach to information
retrieval". In Proceedings of the 21st ACM SIGIR Conf. on Research
and Development in Information Retrieval, Pages: 275 - 281.
[23] W. A. Woods, L. A. Bookman, A. Houston, R. J. Kuhns, P. Martin, S.
Green, "Linguistic knowledge can improve information retrieval".
Applied Natural Language Conferences, Proceedings of the Sixth
Conference on Applied Natural Language Processing, 2000, Pages: 262
- 267.
[24] H. Rode, D. Hiemstra, "Conceptual language models for Context-Aware
text retrieval". Proceedings of the 13th Text Retrieval Conference
(TREC), NIST Special Publications, 2005.
[25] R. Belew, "Adaptive information retrieval". In Proceeding of the
Twelfth Annual International ACM SIGIR Conf. on Research and
Development in Information Retrieval, 1989, Pages: 11 - 20.
[26] H.Chen, "Machine learning for IR: Neural networks, symbolic learning,
and genetic algorithims". Journal of the American Society for
Information Science, Volume 46, No. 3, Pages: 194 - 216.
[27] Rocchio, "The SMART retrieval system experiments in automatic
document processing". Relevance Feedback in Information Retrieval,
Prentice Hall, 1971, Pages: 313 - 323.
[28] Aeroswarm,http://ubot.lockheedmartin.com/ubot/hotdaml/aeroswarm.ht
ml
[29] LCNetTools,http://itlang/vb.net/archivio.asp?subMenu=Tutte&FullText
on&TypeRi=AND&keyword=LCNettools