Estimating Word Translation Probabilities for Thai – English Machine Translation using EM Algorithm

Selecting the word translation from a set of target language words, one that conveys the correct sense of source word and makes more fluent target language output, is one of core problems in machine translation. In this paper we compare the 3 methods of estimating word translation probabilities for selecting the translation word in Thai – English Machine Translation. The 3 methods are (1) Method based on frequency of word translation, (2) Method based on collocation of word translation, and (3) Method based on Expectation Maximization (EM) algorithm. For evaluation we used Thai – English parallel sentences generated by NECTEC. The method based on EM algorithm is the best method in comparison to the other methods and gives the satisfying results.




References:
[1] N. Ide and and J. Veronis, "Introduction to special issue on word sense
disambiguation," The stat of the art. Computational Linguistics, 1998,
24(1):1-40.
[2] NECTEC: National Electronics and Computer Technology Center,
Thailand, http:// www.nectec.or.th.
[3] D. Crystala, Dictionary of Linguistics and Phonetics, Blackwell, Oxford,
UK, 1996.
[4] R. Wardhaugh, Introduction to Linguistics, McGraw-Hill Book
Company. a. The study of language, Language in communication, 1972.
[5] D.Yarowsky, "Unsupervised word sense disambiguation rivaling
supervised methods," in Proc. of the 33rd Annual Meeting of the
Association for Computational Linguistics, 1995.
[6] I. Dagan and Itai, A., "Word sense disambiguation using a second
language monolingual corpus," Computational Linguistics, 20(4):563-
596, 1994.
[7] T. M. Miangah and A. D. Khalafi, "Statistical analysis of target language
corpus for word sense disambiguation in a machine translation system,"
presented at the 9th EAMT European association for Machine translation,
2004.
[8] G. McLachlan and T. Krishnan, The EM algorithm and extensions.
Wiley series in probability and statistics, John Wiley & Sons. , 1997.
[9] A. Mario, Lecture Notes on the EM algorithm, 2004.
[10] J. Cathcart and R. Dale, "Producing a Cross-Language Dictionary using
Statistical Machine," in Australasian Natural Language Processing
Workshop, Macquarie University, Sydney, Australia, 2001.
[11] W. Wang and K. Knight, "Binarizing Syntax Trees to Improve Syntax-
Based Machine Translation Accuracy," in Proc. EMNLP-CoNLL, pp.
746-754, Prague, 2007.
[12] C. Yunbo and L. Hang , "Base Noun Phrase Translation Using Web
Data and the EM Algorithm," in Proc. of COLING-2002, pp.127-133,
2002.
[13] A. Kaban, Introduction to Bayesian Learning, School of Computer
Science University of Birmingham, 2004.
[14] Reuter News Corpus, available :
http://trec.nist.gov/data/reuters/reuters.html
[15] Thai Concordance corpus, Department of Linguistics, Chulalongkorn
University, available: http://www.arts.chula.ac.th/~ling/ThaiConc.
[16] Thai-English Lexitron dictionary, available: http://lexitron.nectec.or.th