Distributional Semantics Approach to Thai Word Sense Disambiguation

Word sense disambiguation is one of the most important open problems in natural language processing applications such as information retrieval and machine translation. Many approach strategies can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledgebased, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy that employs an unsupervised learning method for disambiguation. We report our investigation of Latent Semantic Indexing (LSI), an information retrieval technique and unsupervised learning, to the task of Thai noun and verbal word sense disambiguation. The Latent Semantic Indexing has been shown to be efficient and effective for Information Retrieval. For the purposes of this research, we report experiments on two Thai polysemous words, namely  /hua4/ and /kep1/ that are used as a representative of Thai nouns and verbs respectively. The results of these experiments demonstrate the effectiveness and indicate the potential of applying vector-based distributional information measures to semantic disambiguation.





References:
[1] E. Agirre and G. Rigau, "A proposal for word sense disambiguation
using conceptual distance", In Proc. the International Conference
Recent Advances in Natural Language Processing, Tzigov Chark,
Bulgaria, 1995.
[2] M. W. Berry, S. T. Dumais and G. W. O-Brien, "Using Linear Algebra
for Intelligent Information Retrieval", SIAM: Review, vol.37 no. 4, 1995,
pp. 573-595.
[3] M. W. Berry, "Large Scale Singular Value Computations", International
J. Supercomputer Applications, vol.6, pp. 13-49, 1992.
[4] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer and R.
Harshman, "Indexing by Latent Semantic Analysis", J. the American
Society for Information Science, vol. 41, 1990, pp. 391-407.
[5] R. O. Duda,. P. E. Hart and D. G. Stork, Pattern Classification, 2nd ed.,
Wiley, 2000.
[6] S. T. Dumais, "Latent Semantic Indexing (LSI) and TREC-2", In Proc.
2nd Text Retrieval Conf. (TREC-2), March, 1994, pp. 105-115.
[7] P. W. Foltz, "Latent Semantic Analysis for text-based research",
Behavior Research Methods, Instruments and Computers, vol. 28 no. 2,
1996, pp. 197-202.
[8] I. T. Jolliffe, Principal Component Analysis, Springer Verlag, 1986.
[9] W. Kanokrattanukul, "Word Sense Disambiguation in Thai Using
Decision List Collocation", Master Thesis, Dept. Linguistics,
Chulalongkorn Univ., 2001.
[10] T. K. Landauer and S. T. Dumais, "A Solution to Plato-s Problem: The
Latent Semantic Analysis Theory of the Acquisition, Induction, and
Representation of Knowledge", Psychological Review, vol. 104, no. 2,
1997, pp. 211-240.
[11] C. Leacock, M. Chodorow and G. A. Miller, "Using Corpus Statistics
and WordNet Relations for Sense Identification", Computational
Linguistics, vol. 24, no. 1, 1998, pp. 147-165.
[12] G. A. Miller, M. Chodorow, S. Landes, C. Leacock and R. G. Thomas,
"Using a semantic concordance for sense identification", In Proc. the
ARPA Human Language Technology Workshop, 1994.
[13] H. T. Ng and H. B. Lee, "Integrating Multiple Knowledge Sources to
Disambiguate Word Sense: An Examplar-Based Approach", In Proc.
34th Annual Meeting of the Association for Computational Linguistics,
Santa Cruz, 1996.
[14] T. Pedersen and R. Bruce, "Distinguishing word senses in untagged
text", In Proc. 2nd Conf. Empirical Methods in Natural Language
Processing, 1997, pp. 197-207.
[15] J. I. Saeed, Semantics, The United Kingdom, Blackweel Publishers,
1997.
[16] H. Schutze, "Dimensions of Meaning", In Proc. Supercomputing, 1992,
pp. 787-796.
[17] G. Strang, Algebra and its applications, 2nd ed., Academic Press, 1980.
[18] "Smart Word Analysis for Thai", 2002, National Electronics and
Computer Technology Center (NECTEC), Information Research and
Development Division. [Online] Available: http://
www.links.nectec.or.th/.
[19] D. Yarowsky, "Unsupervised Word Sense Disambiguation Rivaling
Supervised Methods", In Proc. 33rd Annual Meeting of the Association
of Computational Linguistics, Cambridge, Massachusetts, 1995.
[20] U. Zernik, "Train1 vs. Train2: Tagging Word Sense in Corpus. Lexical
Acquisition: Exploiting on-line Resources to Build a Lexicon", In Proc.
Recherche d'Informations Assistée par Ordinateur, 1991, pp. 91-112.