Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model
Parsing is important in Linguistics and Natural
Language Processing to understand the syntax and semantics of a
natural language grammar. Parsing natural language text is
challenging because of the problems like ambiguity and inefficiency.
Also the interpretation of natural language text depends on context
based techniques. A probabilistic component is essential to resolve
ambiguity in both syntax and semantics thereby increasing accuracy
and efficiency of the parser. Tamil language has some inherent
features which are more challenging. In order to obtain the solutions,
lexicalized and statistical approach is to be applied in the parsing
with the aid of a language model. Statistical models mainly focus on
semantics of the language which are suitable for large vocabulary
tasks where as structural methods focus on syntax which models
small vocabulary tasks. A statistical language model based on Trigram
for Tamil language with medium vocabulary of 5000 words has
been built. Though statistical parsing gives better performance
through tri-gram probabilities and large vocabulary size, it has some
disadvantages like focus on semantics rather than syntax, lack of
support in free ordering of words and long term relationship. To
overcome the disadvantages a structural component is to be
incorporated in statistical language models which leads to the
implementation of hybrid language models. This paper has attempted
to build phrase structured hybrid language model which resolves
above mentioned disadvantages. In the development of hybrid
language model, new part of speech tag set for Tamil language has
been developed with more than 500 tags which have the wider
coverage. A phrase structured Treebank has been developed with 326
Tamil sentences which covers more than 5000 words. A hybrid
language model has been trained with the phrase structured Treebank
using immediate head parsing technique. Lexicalized and statistical
parser which employs this hybrid language model and immediate
head parsing technique gives better results than pure grammar and
trigram based model.
[1] Stolcke, A. and Segal, J. Precise Ngram Probabilities from Stochastic
Context-Free Grammars. In Proceedings of the 32nd Annual Meeting of
the Association for Computational Linguistics, 1994, 74-79.
[2] Chi, Z. and Geman, S, Estimation of Probabilistic Context-Free
Grammars. Computational Linguistics 24 2, 1998, 299-306.
[3] Roark B. Probabilistic Top-Down Parsing and Language Modeling,
Association for Computational Linguist, 2001
[4] Collins, M. J. Three Generative Lexicalized Models for Statistical
Parsing. In Proceedings of the 35th Annual Meeting Of The Acl., 16-23.,
1997
[5] Daniel M. Bikel, On the Parameter Space of Generative Lexicalized
Statistical Parsing Models, Ph.D. Thesis, University Of Pennsylvania,
2004
[6] Daniel Jurafsky & James H. Martin, Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics,
and Speech Recognition, 2nd Edition, Pearson Education, 2006
[7] Chelba, C. And Jelinek, F. Exploiting Syntactic Structure for Language
Modeling. In Proceedings for COLING-ACL 98. ACL, Newbrunswick
NJ, 1998, 225-231.
[8] Collins, M. J. Head-Driven Statistical Models for Natural Language
Parsing. University of Pennsylvania, Ph.D. Dissertation, 1999
[9] Brian Roark Eugene Charniak, Measuring Efficiency in High-Accuracy,
Broad-Coverage Statistical Parsing Proceedings of the COLING 2000
Workshop on Efficiency in Large-Scale Parsing Systems, 2001, Pages 29-
36
[10] Chelba, C. And Jelinek, F. Structured Language Modeling. Computer
Speech and Language 14, 2000, 283-332.
[11] Peng Xu, Ciprian Chelba, Richer Syntactic Dependencies for Structured
Language Modeling Computational Linguistics (ACL), Philadelphia,
Proceedings of the 40th Annual Meeting of the Association, 2002
[12] Diego Linares Pontificia and Jos E-Miguel Benedi And Joan-Andreu
Sanchez, A Hybrid Language Model based on a Combination of NGrams
and Stochastic Context-Free Grammars , ACM Transactions on
Asian Language Information Processing, Volume 3, Issue 2, 2004,
Pp.113-127.
[13] Ratnaparkhi, A. Learning to parse Natural Language with Maximum
Entropy Models. Machine Learning 34 1/2/3, 1999, 151-176.
[14] Charniak, E. A Maximum-Entropy Inspired Parser. In Proceedings of the
Conference of the North American Chapter of the Association for
Computational Linguistics . ACL, New Brunswick NJ, 2000
[15] Eugene Charniak, Immediate-Head Parsing for Language Models,
Proceeding of ACL, 2001
[16] Bharati, Akshar, Vineet Chaitanya and Rajeev Sangal, Natural Language
Processing: A Paninian Perspective, Prentice-Hall of India, New Delhi,
1995
[17] Rajendran S, Strategies In The Formation Of Compound Nouns In Tamil,
Languages Of India, Volume 4, 2004
[18] Marcus, M. P., Santorini, B. And Marcinkiewicz, M. A, Building A
Large Annotated Corpus of English: The Penn Treebank. Computational
Linguistics 19, 1993, 313-330
[19] Charniak, E. Tree-Bank Grammars. In Proceedings of the Thirteenth
National Conference on Artificial Intelligence. AAAI Press/MIT Press,
Menlo Park, 1996, 1031-1036.
[20] Akshar Bharati, Rajeev Sangal, Vineet Chaitanya , Anncorra : Building
Tree-Banks in Indian Languages, COLING 2002 Post Conference
Workshops - Proceedings of the 3rd Workshop on Asia Language
Resources and International Standardization at Taipei, Taiwan, 2002.
[1] Stolcke, A. and Segal, J. Precise Ngram Probabilities from Stochastic
Context-Free Grammars. In Proceedings of the 32nd Annual Meeting of
the Association for Computational Linguistics, 1994, 74-79.
[2] Chi, Z. and Geman, S, Estimation of Probabilistic Context-Free
Grammars. Computational Linguistics 24 2, 1998, 299-306.
[3] Roark B. Probabilistic Top-Down Parsing and Language Modeling,
Association for Computational Linguist, 2001
[4] Collins, M. J. Three Generative Lexicalized Models for Statistical
Parsing. In Proceedings of the 35th Annual Meeting Of The Acl., 16-23.,
1997
[5] Daniel M. Bikel, On the Parameter Space of Generative Lexicalized
Statistical Parsing Models, Ph.D. Thesis, University Of Pennsylvania,
2004
[6] Daniel Jurafsky & James H. Martin, Speech and Language Processing: An
Introduction to Natural Language Processing, Computational Linguistics,
and Speech Recognition, 2nd Edition, Pearson Education, 2006
[7] Chelba, C. And Jelinek, F. Exploiting Syntactic Structure for Language
Modeling. In Proceedings for COLING-ACL 98. ACL, Newbrunswick
NJ, 1998, 225-231.
[8] Collins, M. J. Head-Driven Statistical Models for Natural Language
Parsing. University of Pennsylvania, Ph.D. Dissertation, 1999
[9] Brian Roark Eugene Charniak, Measuring Efficiency in High-Accuracy,
Broad-Coverage Statistical Parsing Proceedings of the COLING 2000
Workshop on Efficiency in Large-Scale Parsing Systems, 2001, Pages 29-
36
[10] Chelba, C. And Jelinek, F. Structured Language Modeling. Computer
Speech and Language 14, 2000, 283-332.
[11] Peng Xu, Ciprian Chelba, Richer Syntactic Dependencies for Structured
Language Modeling Computational Linguistics (ACL), Philadelphia,
Proceedings of the 40th Annual Meeting of the Association, 2002
[12] Diego Linares Pontificia and Jos E-Miguel Benedi And Joan-Andreu
Sanchez, A Hybrid Language Model based on a Combination of NGrams
and Stochastic Context-Free Grammars , ACM Transactions on
Asian Language Information Processing, Volume 3, Issue 2, 2004,
Pp.113-127.
[13] Ratnaparkhi, A. Learning to parse Natural Language with Maximum
Entropy Models. Machine Learning 34 1/2/3, 1999, 151-176.
[14] Charniak, E. A Maximum-Entropy Inspired Parser. In Proceedings of the
Conference of the North American Chapter of the Association for
Computational Linguistics . ACL, New Brunswick NJ, 2000
[15] Eugene Charniak, Immediate-Head Parsing for Language Models,
Proceeding of ACL, 2001
[16] Bharati, Akshar, Vineet Chaitanya and Rajeev Sangal, Natural Language
Processing: A Paninian Perspective, Prentice-Hall of India, New Delhi,
1995
[17] Rajendran S, Strategies In The Formation Of Compound Nouns In Tamil,
Languages Of India, Volume 4, 2004
[18] Marcus, M. P., Santorini, B. And Marcinkiewicz, M. A, Building A
Large Annotated Corpus of English: The Penn Treebank. Computational
Linguistics 19, 1993, 313-330
[19] Charniak, E. Tree-Bank Grammars. In Proceedings of the Thirteenth
National Conference on Artificial Intelligence. AAAI Press/MIT Press,
Menlo Park, 1996, 1031-1036.
[20] Akshar Bharati, Rajeev Sangal, Vineet Chaitanya , Anncorra : Building
Tree-Banks in Indian Languages, COLING 2002 Post Conference
Workshops - Proceedings of the 3rd Workshop on Asia Language
Resources and International Standardization at Taipei, Taiwan, 2002.
@article{"International Journal of Information, Control and Computer Sciences:49514", author = "Selvam M and Natarajan. A M and Thangarajan R", title = "Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model", abstract = "Parsing is important in Linguistics and Natural
Language Processing to understand the syntax and semantics of a
natural language grammar. Parsing natural language text is
challenging because of the problems like ambiguity and inefficiency.
Also the interpretation of natural language text depends on context
based techniques. A probabilistic component is essential to resolve
ambiguity in both syntax and semantics thereby increasing accuracy
and efficiency of the parser. Tamil language has some inherent
features which are more challenging. In order to obtain the solutions,
lexicalized and statistical approach is to be applied in the parsing
with the aid of a language model. Statistical models mainly focus on
semantics of the language which are suitable for large vocabulary
tasks where as structural methods focus on syntax which models
small vocabulary tasks. A statistical language model based on Trigram
for Tamil language with medium vocabulary of 5000 words has
been built. Though statistical parsing gives better performance
through tri-gram probabilities and large vocabulary size, it has some
disadvantages like focus on semantics rather than syntax, lack of
support in free ordering of words and long term relationship. To
overcome the disadvantages a structural component is to be
incorporated in statistical language models which leads to the
implementation of hybrid language models. This paper has attempted
to build phrase structured hybrid language model which resolves
above mentioned disadvantages. In the development of hybrid
language model, new part of speech tag set for Tamil language has
been developed with more than 500 tags which have the wider
coverage. A phrase structured Treebank has been developed with 326
Tamil sentences which covers more than 5000 words. A hybrid
language model has been trained with the phrase structured Treebank
using immediate head parsing technique. Lexicalized and statistical
parser which employs this hybrid language model and immediate
head parsing technique gives better results than pure grammar and
trigram based model.", keywords = "Hybrid Language Model, Immediate Head Parsing,
Lexicalized and Statistical Parsing, Natural Language Processing,
Parts of Speech, Probabilistic Context Free Grammar, Tamil
Language, Tree Bank.", volume = "2", number = "3", pages = "628-7", }