Automatic Building an Extensive Arabic FA Terms Dictionary

Field Association (FA) terms are a limited set of discriminating terms that give us the knowledge to identify document fields which are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract automatically relevant Arabic FA Terms to build a comprehensive dictionary. Moreover, all previous studies are based on FA terms in English and Japanese, and the extension of FA terms to other language such Arabic could be definitely strengthen further researches. This paper presents a new method to extract, Arabic FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules and corpora comparison. Experimental evaluation is carried out for 14 different fields using 251 MB of domain-specific corpora obtained from Arabic Wikipedia dumps and Alhyah news selected average of 2,825 FA Terms (single and compound) per field. From the experimental results, recall and precision are 84% and 79% respectively. Therefore, this method selects higher number of relevant Arabic FA Terms at high precision and recall.





References:
[1] Atlam, E., Fuketa, M., Morita, K., Aoe, J. (2003). Documents Similarity
Measurement using Field Association Terms, Information Processing &
Management, 39(6): 809-824.
[2] Atlam, E., Ghada, E., Morita, K., Fuketa, M., Aoe, J. (2006). Automatic
building of new field association word candidates using search engine,
Information Processing & Management, 42(4): 951-962.
[3] Atlam, E., Morita, K., Fuketa, M., Aoe, J. (2002). A new method for
selecting English field association terms of compound words and its
knowledge representation, Information Processing & Management, 38(6):
807-821.
[4] Bennet N.A., He, Q., Powell K., Schatz, B.R. (1999). Extracting noun
phrases for all of MEDLINE, In Proceedings of the AMIA Symposium,
pp. 671-5.
[5] Diab M., Kadri Hacioglu (2004), and Daniel Jurafsky. Automatic tagging
of Arabic text: From raw text to base phrase chunks. In Proceedings of
the 5th Meeting of the North American Chapter of the Association for
Computational Linguistics/Human Language Technologies Conference
(HLTNAACL04), Boston, MA, 2004.
[6] Dorji, T., Atlam, E., Yata, S., Fuketa, M., Morita, K., Aoe, J. (2009)
Building a Dynamic and Comprehensive Field Association Terms
Dictionary from Domain-specific Corpora using Linguistic Knowledge,
In Proceedings of the fifth Corpus Linguistics Conference, Liverpool,
UK.
[7] Dozawa, T. (1999). Innovative multi information dictionary Imidas-99.
Annual Series. Japan: Zueisha Publication Co. (in Japanese).
[8] Drouin, P. (2004). Detection of domain specific terminology using
corpora comparison, In Proceedings of the 4th International conference on
Language resources and evaluation (CLREC), pp. 79-82.
[9] Fuketa, M., Lee, S., Tsuji, T., Okada, M., Aoe, J. (2000). A Document
Classification Method by using Field Association Words, International
Journal of Information Sciences 126: 57-70.
[10] Graham-Cumming, J. (2005) Naive Bayesian Text Classification: Fast,
accurate, and easy to implement, Dr. Dobb's Journal,
http://www.ddj.com/development-tools/184406064, (Accessed 3
September 2009).
[11] Habash, Nizar and Owen Rambow (2005). Arabic Tokenization,
Morphological Analysis, and Part-of-Speech Tagging in One Fell Swoop.
In Proceedings of the Conference of American Association for
Computational Linguistics (ACL05)
[12] Jiang, G., Sato, H., Endoh, A., Ogasawara, K., Sakurai, T. (2005).
Extraction of Specific Nursing Terms Using Corpora Comparison, In
Proceedings of the AMIA Annual Symposium, 2005: 997.
[13] Krauthammer, M., Nenadic, G. (2004). Term identification in the
biomedical literature, Journal of Biomedical Information, 37(6): 512-
526.
[14] Lan M., Tan C., Low H., Sung S. (2005). A comprehensive comparative
study on term weighting schemes for text categorization with support
vector machines. In Posters Proc. 14th International World Wide Web
Conference, pp. 1032-1033.
[15] Lee, S., Shishibori, M., Sumitomo, T., Aoe, J. (2002). Extraction of
Field-coherent Passages, Information Processing & Management, 38(2):
173-207.
[16] Pang, S., Kasabov, N. (2009) Encoding and decoding the knowledge of
association rules over SVM classification trees, Knowledge and
Information Systems, 19(1): 79-105.
[17] Patry, A., Langlais, P., (2005) Corpus-based terminology extraction.
Proceedings of the 7th International Conference on Terminology and
Knowledge Engineering, Copenhagen, Denmark, pp. 313-321.
[18] Peng, T., Zuo, W., He, F. (2008) SVM based adaptive learning method
for text classification from positive and unlabeled documents, Knowledge
and Information Systems, Springer London, 16(3): 281-301.
[19] Rokaya, M., Atlam, E., Fuketa, M., Dorji, T., Aoe, J. (2008) Ranking of
Field Association Terms using co-word analysis, Information Processing
and Management, 44(2): 738-755.
[20] Salton, G., Allan, J., Buckley, C. (1993) Approaches to passage retrieval
in full text information systems. Proceedings of the 16th annual
international ACM/SIGIR conference on research and development in
information retrieval, pp. 49-58.
[21] Saneifar, H., Bonniol, S., Laurent, A., Poncelet, P., Roche, M. (2009)
Terminology Extraction from Log Files, Database and Expert Systems
Applications, Lecture Notes in Computer Science, 5690: 769 - 776.
[22] Sharif, U. M., Ghada, E., Atlam, E., Fuketa, M., Morita, K., Aoe, J.
(2007). Improvement of building field association term dictionary using
passage retrieval, Information Processing and Management, 43(2): 1793-
1807.
[23] Shereen Khoja. 2001. APT: Arabic Part-of-speech Tagger., Proc. of the
Student Workshop at NAACL 2001Smadja, F. (1993) Retrieving
collocations form text: Xtract, Computational Linguistics, 19(1): 143-
177.
[24] Srinivasan, P., Pant, G., Menczer, F. (2005) A general evaluation
framework for regional crawlers. Information Retrieval, 8(3):417-447.
[25] Stanford TreeTagger - a Language-Independent Part-of-speech Tagger,
http://nlp.stanford.edu/software/tagger.shtml (Downloaded 5 November
2009)
[26] Tsuji, T., Nigazawa, H., Okada, M., Aoe, J. (1999) Early Field
Recognition by Using Field Association Words, In Proceedings of the
18th International Conference on Computer Processing of Oriental
Languages, pp. 301-304.
[27] Velardi, P., Navigli, R., D'Amadio, P. (2008) Mining the Web to Create
Specialized Glossaries, IEEE Intelligent Systems, 23(5): 18-25.
[28] Wang, P., Hu, J., Zeng, H., Chen, Z. (2008) Using Wikipedia knowledge
to improve text classification, Knowledge and Information Systems,
19(3): 265-394.
[29] Wikipedia Foundation, Inc., English Wikipedia Dumps,
http://dumps.wikimedia.org/arwiki/ (Downloaded 5 November 2009)