Language and Retrieval Accuracy

One of the major challenges in the Information Retrieval field is handling the massive amount of information available to Internet users. Existing ranking techniques and strategies that govern the retrieval process fall short of expected accuracy. Often relevant documents are buried deep in the list of documents returned by the search engine. In order to improve retrieval accuracy we examine the issue of language effect on the retrieval process. Then, we propose a solution for a more biased, user-centric relevance for retrieved data. The results demonstrate that using indices based on variations of the same language enhances the accuracy of search engines for individual users.




References:
[1] Abdelali A, Cowie J, and Soliman H (2005) Language variation as a
context for information retrieval. International Workshop on Context-
Based Information Retrieval (CIR-05), Paris. July 5th, 2005. CEUR
Workshop Proceedings Vol-151, pp. 93-104.
[2] Abdelali, A. (2004) Localization in Modern Standard Arabic. Journal of
the American Society for Information Science and Technology
(JASIST), Volume 55, Number 1, 2004. pp. 23-28.
[3] Agichtein, E., Brill E., Dumais S., Ragno, R. (2006) Learning user
interaction models for predicting web search result preferences,
Proceedings of the 29th annual international ACM SIGIR conference on
Research and development in information retrieval, August 06-11, 2006,
Seattle, Washington, USA
[4] Agirre, E. and Edmonds, P. (2006) Word Sense Disambiguation
Algorithms and Applications. Series: Text, Speech and Language
Technology, Vol. 33, 2006, ISBN: 978-1-4020-4808-1
[5] Azzopardi L, Girolami M and van Rijsbergen C J (2003) Investigating
the Relationship between Language Model Perplexity and IR Precision-
Recall Measures. In the Proceedings of the 26th Annual ACM
Conference on Research and Development in Information Retrieval,
SIGIR, Toronto, Canada.
[6] Azzopardi L, Girolami M and van Rijsbergen C J (2004) Topic Based
Language Models for ad hoc Information Retrieval. In the Proceedings
of the International Joint Conference on Neural Networks,
Budapest,Hungary.
[7] Cavnar W B and Trenkle M J (1994) N-gram-based text categorization.
Proceedings of SDAIR-94, 3rd Annual Symposium on Document
Analysis and Information Retrieval. Las Vegas, pp. 161-175.
[8] Chang, W.W. and Tsai, W.H. (2000) Chinese dialect identification using
segmental and prosodic features. Acoustical Society of America Journal.
Oct. 2000. Vol.108, pp.1906-1913.
[9] Clarkson P, and Robinson T (1999) Towards improved language model
evaluation measures. In: Proc. Eurospeech, p. 2707.
[10] Cowie J, Yevgeny L, and Zacharski R (1999) Language recognition for
mono- and multi-lingual documents. Proceedings of the Vextal
Conference. Venice 209-214.
[11] Cronen-Townsend, S., Zhou, Y., and Croft, W.B. (2004) A framework
for selective query expansion. Poster presentation, in: Proceedings of
CIKM'04, pp.236-237.
[12] Dean J, and Henzinger M R (1999) Finding related pages in the World
Wide Web. Computer Networks. 31(11-16):1467-79
[13] Dunning T (1994) Statistical identification of language. Technical report
CRL MCCS-94-273, Computing Research Lab, New Mexico State
University.
[14] Gordon M, and Pathak P (1999). Finding information on the World
Wide Web: The retrieval effectiveness of search engines. Information
Processing & Management, 35(2), 141-180.
[15] Grefenstette G (1995) Comparing two language identification
schemes.Third International Conference on Statistical Analysis of
Textual Data. Rome,
[16] Gursky, P., Horvath, T., Novotny, R., Vanekova, V., and Vojtas, P.
2006. UPRE: User Preference Based Search System. In Proceedings of
the 2006 IEEE/WIC/ACM international Conference on Web intelligence
(December 18 - 22, 2006). Web Intelligence. IEEE Computer Society,
Washington, DC, 841-844.
[17] House A. S. and Neuburg, E. P. (1977). Toward automatic identification
of the language of an utterance. I. Preliminary methodological
considerations. Acoustical Society of America Journal. Vol 62. pp. 708-
713.
[18] Ide N, and Macleod C (2001). The American national corpus: A
standardized resource of American English. Proceedings of Corpus
Linguistics 2001, Lancaster UK.
[19] Kennedy G (1998) An introduction to corpus linguistics. Addison
Wesley Longman.
[20] Kohonen T (1997). Self-organizing maps, 2nd Edition (Berlin; New
York: Springer).
[21] Lafferty J (1997) The noisy channel model. Class notes to statistical
methods in language technologies, Carnegie Mellon University
Language Technology Institute,
www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/11761-
s97/WWW/tex/channel.ps December 22, 2005
[22] Lawrence S, and Giles C L (1998) Searching the World Wide Web.
Science, 280: 98-100.
[23] MacQueen J B (1967) Some Methods for classification and Analysis of
Multivariate Observations. Proceedings of 5th Berkeley Symposium on
Mathematical Statistics and Probability, Berkeley, University of
California Press, Vol. 1. pp.281-297.
[24] Manning C and Sch├╝tze H (1999). Foundations of statistical natural
language processing. MIT Press. Cambridge, MA.
[25] McNamee P (2004). Language identification: A solved problem suitable
for undergraduate instruction. Proceedings of the 20th Annual
Consortium for Computing Sciences in Colleges East (CCSCE-04), pp.
94-101.
[26] Moore A (2001) K-means and Hierarchical Clustering - Tutorial Slides.
Available at http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html
Retrieved on August 29, 2006.
[27] Ponte J M and Croft W B (1998) A language modeling approach to
information retrieval system. in Proc. ACM. SIGIR 98, New York, 1998,
pp. 275-281.
[28] Purnell, T.; Idsardi, W., and Baugh, J. (1999). Perceptual and Phonetic
Experiments on American English Dialect Identification. Journal of
Language and Social Psychology, Mar 1999; Vol. 18. pp.10-30.
[29] Sethy A, Georgiou P, and Narayanan S (2005). Building topic specific
language models from webdata using competitive models. In Proc. of
EUROSPEECH, Interspeech, Lisbon, Portugal.
[30] Siatri R (1998) Information seeking in electronic environment: a
comparative investigation among computer scientists in British and
Greek Universities. Information Research, Volume 4 No. 2.
[31] Spink A (2002). A user centered approach to evaluating human
interaction with Web search engines: an exploratory study. Information
Processing & Management, 38(3), 410-426.
[32] Torres-Carrasquillo, P. A., Gleason, T. P., and Reynolds, D. A., (2004).
Dialect Identification Using Gaussian Mixture Models. In Proc.
Odyssey: The Speaker and Language Recognition Workshop in Toledo,
Spain, ISCA, pp. 297-300, 31 May - 3 June 2004.
[33] W3C (2005) Corpus linguistics.
http://www.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/introdu
ction.html.
[34] Wulff S, Gries T S, and Stefanowitsch A (2005) Brutal Brits and
argumentative Americans: What collostructional analysis can tell us
about lectal variation? Paper presented at the ICLC 2005, Yonsei
University.
[35] Yeung K Y, and Ruzzo W L (2001). Principal Component Analysis for
clustering gene expression data. Bioinformatics 17, 763-774.