Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model

Parsing is important in Linguistics and Natural Language Processing to understand the syntax and semantics of a natural language grammar. Parsing natural language text is challenging because of the problems like ambiguity and inefficiency. Also the interpretation of natural language text depends on context based techniques. A probabilistic component is essential to resolve ambiguity in both syntax and semantics thereby increasing accuracy and efficiency of the parser. Tamil language has some inherent features which are more challenging. In order to obtain the solutions, lexicalized and statistical approach is to be applied in the parsing with the aid of a language model. Statistical models mainly focus on semantics of the language which are suitable for large vocabulary tasks where as structural methods focus on syntax which models small vocabulary tasks. A statistical language model based on Trigram for Tamil language with medium vocabulary of 5000 words has been built. Though statistical parsing gives better performance through tri-gram probabilities and large vocabulary size, it has some disadvantages like focus on semantics rather than syntax, lack of support in free ordering of words and long term relationship. To overcome the disadvantages a structural component is to be incorporated in statistical language models which leads to the implementation of hybrid language models. This paper has attempted to build phrase structured hybrid language model which resolves above mentioned disadvantages. In the development of hybrid language model, new part of speech tag set for Tamil language has been developed with more than 500 tags which have the wider coverage. A phrase structured Treebank has been developed with 326 Tamil sentences which covers more than 5000 words. A hybrid language model has been trained with the phrase structured Treebank using immediate head parsing technique. Lexicalized and statistical parser which employs this hybrid language model and immediate head parsing technique gives better results than pure grammar and trigram based model.




References:
[1] Stolcke, A. and Segal, J. Precise Ngram Probabilities from Stochastic
Context-Free Grammars. In Proceedings of the 32nd Annual Meeting of
the Association for Computational Linguistics, 1994, 74-79.
[2] Chi, Z. and Geman, S, Estimation of Probabilistic Context-Free
Grammars. Computational Linguistics 24 2, 1998, 299-306.
[3] Roark B. Probabilistic Top-Down Parsing and Language Modeling,
Association for Computational Linguist, 2001
[4] Collins, M. J. Three Generative Lexicalized Models for Statistical
Parsing. In Proceedings of the 35th Annual Meeting Of The Acl., 16-23.,
1997
[5] Daniel M. Bikel, On the Parameter Space of Generative Lexicalized
Statistical Parsing Models, Ph.D. Thesis, University Of Pennsylvania,
2004
[6] Daniel Jurafsky & James H. Martin, Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics,
and Speech Recognition, 2nd Edition, Pearson Education, 2006
[7] Chelba, C. And Jelinek, F. Exploiting Syntactic Structure for Language
Modeling. In Proceedings for COLING-ACL 98. ACL, Newbrunswick
NJ, 1998, 225-231.
[8] Collins, M. J. Head-Driven Statistical Models for Natural Language
Parsing. University of Pennsylvania, Ph.D. Dissertation, 1999
[9] Brian Roark Eugene Charniak, Measuring Efficiency in High-Accuracy,
Broad-Coverage Statistical Parsing Proceedings of the COLING 2000
Workshop on Efficiency in Large-Scale Parsing Systems, 2001, Pages 29-
36
[10] Chelba, C. And Jelinek, F. Structured Language Modeling. Computer
Speech and Language 14, 2000, 283-332.
[11] Peng Xu, Ciprian Chelba, Richer Syntactic Dependencies for Structured
Language Modeling Computational Linguistics (ACL), Philadelphia,
Proceedings of the 40th Annual Meeting of the Association, 2002
[12] Diego Linares Pontificia and Jos E-Miguel Benedi And Joan-Andreu
Sanchez, A Hybrid Language Model based on a Combination of NGrams
and Stochastic Context-Free Grammars , ACM Transactions on
Asian Language Information Processing, Volume 3, Issue 2, 2004,
Pp.113-127.
[13] Ratnaparkhi, A. Learning to parse Natural Language with Maximum
Entropy Models. Machine Learning 34 1/2/3, 1999, 151-176.
[14] Charniak, E. A Maximum-Entropy Inspired Parser. In Proceedings of the
Conference of the North American Chapter of the Association for
Computational Linguistics . ACL, New Brunswick NJ, 2000
[15] Eugene Charniak, Immediate-Head Parsing for Language Models,
Proceeding of ACL, 2001
[16] Bharati, Akshar, Vineet Chaitanya and Rajeev Sangal, Natural Language
Processing: A Paninian Perspective, Prentice-Hall of India, New Delhi,
1995
[17] Rajendran S, Strategies In The Formation Of Compound Nouns In Tamil,
Languages Of India, Volume 4, 2004
[18] Marcus, M. P., Santorini, B. And Marcinkiewicz, M. A, Building A
Large Annotated Corpus of English: The Penn Treebank. Computational
Linguistics 19, 1993, 313-330
[19] Charniak, E. Tree-Bank Grammars. In Proceedings of the Thirteenth
National Conference on Artificial Intelligence. AAAI Press/MIT Press,
Menlo Park, 1996, 1031-1036.
[20] Akshar Bharati, Rajeev Sangal, Vineet Chaitanya , Anncorra : Building
Tree-Banks in Indian Languages, COLING 2002 Post Conference
Workshops - Proceedings of the 3rd Workshop on Asia Language
Resources and International Standardization at Taipei, Taiwan, 2002.