Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems
Documents retrieval in Information Retrieval
Systems (IRS) is generally about understanding of
information in the documents concern. The more the system
able to understand the contents of documents the more
effective will be the retrieval outcomes. But understanding of the
contents is a very complex task. Conventional IRS apply algorithms
that can only approximate the meaning of document contents through
keywords approach using vector space model. Keywords may be
unstemmed or stemmed. When keywords are stemmed and conflated
in retrieving process, we are a step forwards in applying semantic
technology in IRS. Word stemming is a process in morphological
analysis under natural language processing, before syntactic and
semantic analysis. We have developed algorithms for Malay and
Arabic and incorporated stemming in our experimental systems in
order to measure retrieval effectiveness. The results have shown that
the retrieval effectiveness has increased when stemming is used in
the systems.
[1] Mizzaro, S. Relevance: The Whole History. Journal of American Society
of Information Science, Vol.48, No.9, 1997. pp.810-832.
[2] Gagne, E.D., Yekovich, C.W., Yekovich, F.R. The Cognitive
Psychology of The School Learning, Harper Collin. 1993.
[3] Freund, G.E. & Willett, P. Online identification of word variants and
arbitrary truncation searching using a string similarity measure.
Information Technology: Research and Development 1: 1982. 177-187.
[4] Lennon, M., Pierce, D., Tarry, B. & Willett, P. An evaluation of some
conflation algorithms for information retrieval. Journal of Information
Science 3: 1981. 177-183.
[5] Ekmekcioglu, F.C., Lynch, M.F., Robertson, A.M., Sembok, T.M.T. &
Willett, P. Comparison of n-gram matching and stemming for term
conflation in English, Malay, and Turkish texts. Text Technology: The
Journal of Computer Text Processing 6(1): 1996. 1-14.
[6] Porter M.F. An Algorithm for suffix stripping, Program, 14(3), 1980.
pp.130-137.
[7] Othman, A. Pengakar perkataan melayu untuk sistem capaian dokumen.
MSc Thesis. National University of Malaysia. 1993.
[8] Fatimah Ahmad, Mohammed Yusoff, Tengku Mohd. T. Sembok.
"Experiments with A Malay Stemming Algorithm", Journal of American
Society of Information Science. 1996.
[9] Sembok, T.M.T, Yussoff, M. & Ahmad, F. A malay stemming algorithm
for information retrieval. Proceedings of the 4th International
Conference and Exhibition on Multi-lingual Computing. 1994. 5.1.2.1-
5.1.2.10.
[10] Hani Moh'd Al-Omari, Tengku Mohd. T. Sembok, Mohammed Yusoff,
ALMAS: An Arabic Language Morphological Analyser System,
Malaysian Journal of Computer Science, Vol. 8, no.2, University of
Malaya. 1995.
[11] Belal Abu Ata, Tengku Mohd T. Sembok, Mohamed Yusoff.
Implementions of A Malay Stemming Algorithm Using Hashing
Technique, Proceedinds of the ICIMU-98: International Conference on
Information Technology and Multimedia, UNITEN, 28-30 Sept. 1998.
[12] Sembok, Tengku Mohd Tengku. Application of Mathematical
Functional Decomposition in Document Indexing, Prosiding :
Pengintegrasian Technologi dalam Sains Matematik. Penang: USM.
1999.
[13] Saidah Saad. 1998. Pembangunan dan Eksperiment ke atas satu sistem
capaian maklumat Al-Quran dwi bahasa berasaskan Web. MSc. Thesis.
UKM.
[14] Sembok, T.M.T. & Willett, P. Experiments with n-gram string-similarity
measure on malay texts. Technical Report. Universiti Kebangsaan
Malaysia. 1995.
[1] Mizzaro, S. Relevance: The Whole History. Journal of American Society
of Information Science, Vol.48, No.9, 1997. pp.810-832.
[2] Gagne, E.D., Yekovich, C.W., Yekovich, F.R. The Cognitive
Psychology of The School Learning, Harper Collin. 1993.
[3] Freund, G.E. & Willett, P. Online identification of word variants and
arbitrary truncation searching using a string similarity measure.
Information Technology: Research and Development 1: 1982. 177-187.
[4] Lennon, M., Pierce, D., Tarry, B. & Willett, P. An evaluation of some
conflation algorithms for information retrieval. Journal of Information
Science 3: 1981. 177-183.
[5] Ekmekcioglu, F.C., Lynch, M.F., Robertson, A.M., Sembok, T.M.T. &
Willett, P. Comparison of n-gram matching and stemming for term
conflation in English, Malay, and Turkish texts. Text Technology: The
Journal of Computer Text Processing 6(1): 1996. 1-14.
[6] Porter M.F. An Algorithm for suffix stripping, Program, 14(3), 1980.
pp.130-137.
[7] Othman, A. Pengakar perkataan melayu untuk sistem capaian dokumen.
MSc Thesis. National University of Malaysia. 1993.
[8] Fatimah Ahmad, Mohammed Yusoff, Tengku Mohd. T. Sembok.
"Experiments with A Malay Stemming Algorithm", Journal of American
Society of Information Science. 1996.
[9] Sembok, T.M.T, Yussoff, M. & Ahmad, F. A malay stemming algorithm
for information retrieval. Proceedings of the 4th International
Conference and Exhibition on Multi-lingual Computing. 1994. 5.1.2.1-
5.1.2.10.
[10] Hani Moh'd Al-Omari, Tengku Mohd. T. Sembok, Mohammed Yusoff,
ALMAS: An Arabic Language Morphological Analyser System,
Malaysian Journal of Computer Science, Vol. 8, no.2, University of
Malaya. 1995.
[11] Belal Abu Ata, Tengku Mohd T. Sembok, Mohamed Yusoff.
Implementions of A Malay Stemming Algorithm Using Hashing
Technique, Proceedinds of the ICIMU-98: International Conference on
Information Technology and Multimedia, UNITEN, 28-30 Sept. 1998.
[12] Sembok, Tengku Mohd Tengku. Application of Mathematical
Functional Decomposition in Document Indexing, Prosiding :
Pengintegrasian Technologi dalam Sains Matematik. Penang: USM.
1999.
[13] Saidah Saad. 1998. Pembangunan dan Eksperiment ke atas satu sistem
capaian maklumat Al-Quran dwi bahasa berasaskan Web. MSc. Thesis.
UKM.
[14] Sembok, T.M.T. & Willett, P. Experiments with n-gram string-similarity
measure on malay texts. Technical Report. Universiti Kebangsaan
Malaysia. 1995.
@article{"International Journal of Information, Control and Computer Sciences:59319", author = "Tengku Mohd T. Sembok", title = "Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems", abstract = "Documents retrieval in Information Retrieval
Systems (IRS) is generally about understanding of
information in the documents concern. The more the system
able to understand the contents of documents the more
effective will be the retrieval outcomes. But understanding of the
contents is a very complex task. Conventional IRS apply algorithms
that can only approximate the meaning of document contents through
keywords approach using vector space model. Keywords may be
unstemmed or stemmed. When keywords are stemmed and conflated
in retrieving process, we are a step forwards in applying semantic
technology in IRS. Word stemming is a process in morphological
analysis under natural language processing, before syntactic and
semantic analysis. We have developed algorithms for Malay and
Arabic and incorporated stemming in our experimental systems in
order to measure retrieval effectiveness. The results have shown that
the retrieval effectiveness has increased when stemming is used in
the systems.", keywords = "Information Retrieval, Natural Language Processing,Artificial Intelligence.", volume = "1", number = "10", pages = "3197-3", }