Determining the Gender of Korean Names for Pronoun Generation

It is an important task in Korean-English machine translation to classify the gender of names correctly. When a sentence is composed of two or more clauses and only one subject is given as a proper noun, it is important to find the gender of the proper noun for correct translation of the sentence. This is because a singular pronoun has a gender in English while it does not in Korean. Thus, in Korean-English machine translation, the gender of a proper noun should be determined. More generally, this task can be expanded into the classification of the general Korean names. This paper proposes a statistical method for this problem. By considering a name as just a sequence of syllables, it is possible to get a statistics for each name from a collection of names. An evaluation of the proposed method yields the improvement in accuracy over the simple looking-up of the collection. While the accuracy of the looking-up method is 64.11%, that of the proposed method is 81.49%. This implies that the proposed method is more plausible for the gender classification of the Korean names.




References:
[1] E.-S. Chung, Y.-G. Hwang, and M.-G. Jang, "Korean Named Entity Recognition Using HMM and Co-Training Model," In Proceedings of
the 6th International Workshop on Information Retrieval with Asian
Languages, pp. 161-167, 2003.
[2] C. Drummond and R. Holte, "C4.5, Class Imbalance, and Cost Sensitivity:
Why Under-Sampling beats Over-Sampling," In Proceedings of Workshop on Learning from Imabalanced Datasets II, ICML, 2003.
[3] N.-R. Han, Korean Zero Pronouns: Analysis and Resolution, Ph.D
Thesis, University of Pennsylvania, 2006.
[4] S. Katz, "Estimation of Probabilities from Sparse Data for the Language
Model Component of a Speech Recognizer," IEEE Transactions on
Acoustics, Speech, and Signal Processing, Vol. 35, No. 3, pp. 400-401, 1987.
[5] K.-N. Kim, Y.-H. Yoon, H.-S. Kim, and J.-Y. Seo, "Named Entity
Recognition Using Acyclic Weighted Digraphs: A Semi-Supervised Statistical Method," Lecture Notes in Computer Science, Vol. 4426, pp. 571-578, 2007.
[6] Y.-T. Kim, Introduction to Natural Language Processing, 2nd Edition,
Saeng-Neung Publisher, 2001. (In Korean)
[7] B.-K. Kwak and J.-W. Cha, "Named Entity Tagging for Korean Using
DL-CoTrain Algorithm," Lecture Notes in Computer Science, Vol. 3689, pp. 589-594, 2005.
[8] C.-K. Lee, Y.-G. Hwang, H.-J. Oh, S.-J. Lim, J. Heo, C.-H. Lee, H.-J. Kim, J.-H. Wang, and M.-G. Jang, "Fine-Grained Named Entity
Recognition Using Conditional Random Fields for Question Answering," Lecture Notes in Computer Science, Vol. 4182, pp. 581-587, 2006.
[9] S.-H. Lee, D. Byron, and S.-B. Jang, "Why Is Zero Marking Important in Korean?" In Proceedings of the 2nd International Conference on Natural Language Processing, pp. 588-599, 2005.
[10] J.-E. Roh and J.-H. Lee, "Generation of Zero Pronouns Based on the
Centering Theory and Pairwise Salience of Entities," IEICE Transactions
on Information and Systems, Vol. E880D(2), pp. 837-846, 2006.
[11] C.-N. Seon, Y-.J. Ko, J. Kim, and J.-Y. Seo, "Named Entity Recognition
Using Machine Learning Methods and Pattern-Recognition Rules,"
In Proceedings of the 6th Natural Language Processing Pacific Rim
Symposium, 2001.
[12] S. Zhao and H. Ng, "Identification and Resolution of Chinese Zero
Pronouns: A Machine Learning Approach," In Proceedings of the 2007
Joint Conference on Empirical Methods in Natural Language Processing
and Computational Natural Language Learning, pp. 541-550, 2007.
[13] G. Zhou and J. Su, "Named Entity Recognition Using an HMM-Based
Chunk Tagger," In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics, pp. 473-480, 2002.