Performance Analysis of MT Evaluation Measures and Test Suites

Many measures have been proposed for machine translation evaluation (MTE) while little research has been done on the performance of MTE methods. This paper is an effort for MTE performance analysis. A general frame is proposed for the description of the MTE measure and the test suite, including whether the automatic measure is consistent with human evaluation, whether different results from various measures or test suites are consistent, whether the content of the test suite is suitable for performance evaluation, the degree of difficulty of the test suite and its influence on the MTE, the relationship of MTE result significance and the size of the test suite, etc. For a better clarification of the frame, several experiment results are analyzed relating human evaluation, BLEU evaluation, and typological MTE. A visualization method is introduced for better presentation of the results. The study aims for aid in construction of test suite and method selection in MTE practice.




References:
[1] ALPAC (1966). Languages and machines: computers in translation and
linguistics. A report by the Automatic Language Processing Advisory
Committee, National Research Council. Washington, D.C. National
Academy of Sciences.
[2] White J.S., T.A. O-Connell (1994). The ARPA MT evaluation
methodologies: evolution, lessons, and further approaches. Proceedings
of the 1994 Conference of the Association for Machine Translation in
the Americas, Columbia, MD. pp. 193-205.
[3] ISLE (2000). The ISLE classification of machine translation evaluations,
draft 1. A document by the International Standards for Language
Engineering. See http://www.isi.edu/natural-language/mteval/
[4] Jones Douglas A., Gregory M. Rusk (2000). Toward a scoring function
for quality-driven machine translation. Proceedings of the International
Conference on Computational Linguistics. pp. 376-382.
[5] Brew C, Thompson H.S. (1994). Automatic evaluation of computer
generated text: a progress report on the TextEval project. Proceedings of
the Human Language Technology Workshop. pp. 108-113.
[6] Yasuda Keiji, Fumiaki Sugaya, et al (2001). An automatic evaluation
method of translation quality using translation answer candidates
queried from a parallel corpus. MT Summit Conference, Santiago de
Compostela. pp. 373-378.
[7] Akiba Yasuhiro, Kenji Imamura, Eiichiro Sumita (2001). Using multiple
edit distances to automatically rank machine translation output. MT
Summit Conference, Santiago de Compostela. pp. 15-20.
[8] Papineni K., S.Roukos, T.Ward, W.-J. Zhu (2001). BLEU: a method for
automatic evaluation of MT. Research Report, Computer Science
RC22176(W0109-022), IBM Research Division, T.J.Watson Research
Cente. See http://domino.watson.ibm.com/library/
[9] NIST (2002). The NIST 2002 machine translation evaluation plan. A
document by the National Institute of Standards and Technology. See
http://www.nist.gov/speech/tests/mt/doc/2002- MT-EvalPlan- v1.3.pdf
[10] I.D.Melamed, R.Green, J.P.Turian (2003). Precision and recall of
machine translation. NAACL/Human Language Technology 2003,
Edmonton, Canada.
[11] S. Niβen, F. J. Och, et al (2000). An evaluation tool for machine
translation: fast evaluation for MT research. 2nd International Conference
on Language Resources and Evaluation. Athens, Greece. pp. 39-45.
[12] Yokoyama S. et al. (1999). Quantitative evaluation of machine
translation using two-way MT. Proceeding of Machine Translation
Summit VII. pp. 568--573.
[13] Yu Shiwen (1993). Automatic Evaluation of Quality for Machine
Translation Systems. Machine Translation, 8. pp. 117-126.
[14] Guessoum A., R. Zantout (2001). Semi-automatic evaluation of the
grammatical coverage of machine translation systems. MT Summit
Conference , Santiago de Compostela. pp. 133-138.
[15] Popesc-Belis (1999). Evaluation of natural language processing systems:
a model for coherence verification of quality measure. Marc Blasband
and Patrick Paroubek, editors, A Blueprint for a General Infrastructure
for Natural Language Processing Systems Evaluation Using Semi-
Automatic Quantitative Black Box Approach in a Multilingual
Environment. ELSE Project LE4-8340 (Evaluation in Language and
Speech Engineering).
[16] Yao Jianmin, Ming Zhou et al (2002). An automatic evaluation method
for localization oriented lexicalised EBMT system. The 19th
International Conference on Computational Linguistics, Taipei. pp.
1142-1148.
[17] Zhang Minqiang (2003). Education measurement. 2nd edition. People-s
Education Press Beijing. (in Chinese). pp. 98-132.
[18] Darwin, M. (2001). Trial and Error: An Evaluation Project on Japanese-
English MT Output Quality. In Maegaard, B. (Ed.). MT Summit VIII,
77-82. Santiago de Compostela, Spain: European Association for
Machine Translation (EAMT).