Named Entity Recognition using Support Vector Machine: A Language Independent Approach

Named Entity Recognition (NER) aims to classify each word of a document into predefined target named entity classes and is now-a-days considered to be fundamental for many Natural Language Processing (NLP) tasks such as information retrieval, machine translation, information extraction, question answering systems and others. This paper reports about the development of a NER system for Bengali and Hindi using Support Vector Machine (SVM). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Indian languages (ILs) is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named (NE) classes, such as Person name, Location name, Organization name and Miscellaneous name. We have used the annotated corpora of 122,467 tokens of Bengali and 502,974 tokens of Hindi tagged with the twelve different NE classes 1, defined as part of the IJCNLP-08 NER Shared Task for South and South East Asian Languages (SSEAL) 2. In addition, we have manually annotated 150K wordforms of the Bengali news corpus, developed from the web-archive of a leading Bengali newspaper. We have also developed an unsupervised algorithm in order to generate the lexical context patterns from a part of the unlabeled Bengali news corpus. Lexical patterns have been used as the features of SVM in order to improve the system performance. The NER system has been tested with the gold standard test sets of 35K, and 60K tokens for Bengali, and Hindi, respectively. Evaluation results have demonstrated the recall, precision, and f-score values of 88.61%, 80.12%, and 84.15%, respectively, for Bengali and 80.23%, 74.34%, and 77.17%, respectively, for Hindi. Results show the improvement in the f-score by 5.13% with the use of context patterns. Statistical analysis, ANOVA is also performed to compare the performance of the proposed NER system with that of the existing HMM based system for both the languages.





References:
[1] N. Chinchor, "MUC-6 Named Entity Task Definition (Version 2.1)," in
MUC-6, 1995.
[2] N. Chinchor, "MUC-7 Named Entity Task Definition (Version 3.5)," in
MUC-7, 1998.
[3] H. Cunningham, "GATE, a General Architecture for Text Engineering,"
Computers and the Humanities, vol. 36, pp. 223-254, 2002.
[4] D. Moldovan, S. Harabagiu, R. Girju, P. Morarescu, F. Lacatusu,
A. Novischi, A. Badulescu, and O. Bolohan, "LCC Tools for Question
Answering," in Text REtrieval Conference (TREC) 2002, 2002.
[5] B. Babych and A. Hartley, "Improving Machine Translation Quality with
Automatic Named Entity Recognition," in Proceedings of EAMT/EACL
2003 Workshop on MT and other Language Technology Tools, pp. 1-8,
2003.
[6] S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schawartz, R. Stone,
R. Weischedel, and the Annotation Group, "BBN: Description of the
SIFT System as Used for MUC-7," in MUC-7, (Fairfax, Virginia), 1998.
[7] D. M. Bikel, R. L. Schwartz, and R. M. Weischedel, "An Algorithm
that Learns What-s in a Name," Machine Learning, vol. 34, no. 1-3,
pp. 211-231, 1999.
[8] A. Borthwick, Maximum Entropy Approach to Named Entity Recognition.
PhD thesis, New York University, 1999.
[9] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman,
"NYU:Description of the MENE Named Entity System as Used
in MUC-7," in MUC-7, 1998.
[10] S. Sekine, "Description of the Japanese NE System used for MET-2,"
in MUC-7, (Fairfax, Virginia), 1998.
[11] S. W. Bennet, C. Aone, and C. Lovell, "Learning to Tag Multilingual
Texts Through Observation," in Proceedings of Empirical Methods of
Natural Language Processing, (Providence, Rhode Island), pp. 109-116,
1997.
[12] A. McCallum and W. Li, "Early results for Named Entity Recognition
with Conditional Random Fields, Feature Induction and Web-enhanced
Lexicons," in Proceedings of CoNLL, (Canada), pp. 188-191, 2003.
[13] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, "Conditional Random
Fields: Probabilistic Models for Segmenting and Labeling Sequence
Data," in Proceedings of the 18th International Conference on Machine
Learning (ICML), pp. 282-289, 2001.
[14] A. Sun, "Using Support Vector Machine for Terrorism Information Extraction,"
in Proceedings of the 1st NSF/NIJ Symposium on Intelligence
and Security, 2003.
[15] A. De Sitter and W. Daelemans, "Information Extraction via Double
Classification," in Proceedings of International Workshop on Adaptive
Text Extraction and Mining, (Dubrovnik), 2003.
[16] N. Kushmerick, E. Johnston, and S. McGuinness, "Information Extraction
by Text Classification," in Proceedings of IJCAI-01 Workshop on
Adaptive Text Extraction and Mining, (Seattle, WA), 2001.
[17] A. McCallum, D. Freitag, and F. Pereira, "Maximum Entropy Markov
Models for Information Extraction and Segmentation," in Proceedings
of the 17th International Conference on Machine Learning (ICML),
pp. 591-598, 2000.
[18] R. Malouf, "Markov Models for Language Independent Named Entity
Recognition," in Proceedings of the 6th Conference on Natural Language
Learning (CoNLL-2002), (Taipei, Taiwan), pp. 187-190, 2002.
[19] J. D. Burger, J. C. Henderson, and T. Morgan, "Statistical Named Entity
Recognizer Adaption," in Proceedings of the CoNLL Workshop, (Taipei,
Taiwan), pp. 163-166, 2002.
[20] X. Carrears, L. Marquez, and L. Padro, "Named Entity Recognition
using AdaBoost," in Proceedings of the CoNLL Workshop, (Taipei,
Taiwan), pp. 167-170, 2002.
[21] G. Zhou and J. Su, "Named Entity Recognition using an HMM-based
Chunk Tagger," in Proceedings of ACL, (Philadelphia), pp. 473-480,
2002.
[22] H. Yamada, T. Kudo, and Y. Matsumoto, "Japanese Named Entity
Extraction using Support Vector Machine," In Transactions of IPSJ,
vol. 43, no. 1, pp. 44-53, 2001.
[23] T. Kudo and Y. Matsumoto, "Chunking with Support Vector Machines,"
in Proceed-ings of NAACL, pp. 192-199, 2001.
[24] K. Takeuchi and N. Collier, "Use of Support Vector Machines in Extended
Named Entity Recognition," in Proceedings of the 6th Conference
on Natural Language Learning (CoNLL-2002), pp. 119-125, 2002.
[25] A. Masayuki and Y. Matsumoto, "Japanese Named Entity Extraction
with Redundant Morphological Analysis," in NAACL -03: Proceedings
of the 2003 Conference of the North American Chapter of the Association
for Computational Linguistics on Human Language Technology,
(Morristown, NJ, USA), pp. 8-15, Association for Computational Linguistics,
2003.
[26] A. Ekbal and S. Bandyopadhyay, "Pattern Based Bootstrapping Method
for Named Entity Recognition," in Proceedings of the 6th International
Conference on Advances in Pattern Recognition (ICAPR), pp. 349-355,
World Scientific, 2007.
[27] A. Ekbal and S. Bandyopadhyay, "Lexical Pattern Learning from Corpus
Data for Named Entity Recognition," in Proceedings of 5th International
Conference on Natural Language Processing (ICON), (India), pp. 123-
128, 2007.
[28] A. Ekbal, S. Naskar, and S. Bandyopadhyay, "Named Entity Recognition
and Transliteration in Bengali," Named Entities: Recognition, Classification
and Use, Special Issue of Lingvisticae Investigationes Journal,
vol. 30, no. 1, pp. 95-114, 2007.
[29] A. Ekbal and S. Bandyopadhyay, "Bengali Named Entity Recognition
using Support Vector Machine," in Proceedings of Workshop on NER
for South and South East Asian Languages, 3rd International Joint
Conference on Natural Languge Processing (IJCNLP), (India), pp. 51-
58, 2008.
[30] W. Li and A. McCallum, "Rapid Development of Hindi Named Entity
Recognition using Conditional Random Fields and Feature Induction,"
ACM Transactions on Asian Languages Information Processing, vol. 2,
no. 3, pp. 290-294, 2004.
[31] A. Ekbal and S. Bandyopadhyay, "A Hidden Markov Model Based
Named Entity Recognition System: Bengali and Hindi as Case Studies,"
in Proceedings of the 2nd International Conference on Pattern Recognition
and Machine Intelligence (PReMI 2007), pp. 545-552, Springer
Verlag, 2007.
[32] V. N. Vapnik, The nature of statistical learning theory. New York, NY,
USA: Springer-Verlag New York, Inc., 1995.
[33] C. C and V. N. Vapnik, "Support Vector Networks," Machine Learning,
vol. 20, pp. 273-297, 1995.
[34] T. Joachims, "Making large-scale support vector machine learning
practical," pp. 169-184, 1999.
[35] H. Taira and M. Haruno, "Feature Selection in SVM Text Categorization,"
in Proceedings of AAAI-99, 1999.
[36] A. Ekbal and S. Bandyopadhyay, "A Web-based Bengali News Corpus
for Named Entity Recognition," Language Resources and Evaluation
Journal, vol. 42, no. 2, 2008.
[37] M. Collins and Y. Singer, "Unsupervised models for named entity
classification," in Proceedings of the Joint SIGDAT Conference on
Empirical Methods in Natural Language Processing and Very Large
Corpora, 1999.
[38] S. Cucerzon and D. Yarowsky, "Language Independent Named Entity
Recognition Combining Morphological and Contextual Evidence," in
Proceedings of the 1999 Joint SIGDAT conference on EMNLP and VLC,
(Washington, D.C.), 1999.
[39] S. Cucerzan and D. Yarowsky, "Language Independent NER using a
Unified Model of Internal and Contextual Evidence," in Proceedings of
CoNLL 2002, pp. 171-175, 2002.
[40] W. Phillips and E. Riloff, "Exploiting Strong Syntactic Heuristics and
Co-training to Learn Semantic Lexicons," in EMNLP -02: Proceedings
of the ACL-02 conference on Empirical methods in natural language
processing, (Morristown, NJ, USA), pp. 125-132, Association for Computational
Linguistics, 2002.
[41] E. Riloff and R. Jones, "Learning Dictionaries for Information Extraction
by Multi-level Bootstrapping," in AAAI -99/IAAI -99: Proceedings of the
sixteenth national conference on Artificial intelligence and the eleventh
Innovative applications of artificial intelligence conference innovative
applications of artificial intelligence, (Menlo Park, CA, USA), pp. 474-
479, American Association for Artificial Intelligence, 1999.
[42] M. Thelen and E. Riloff, "A Bootstrapping Method for Learning
Semantic Lexicons using Extraction Pattern Contexts," in EMNLP -02:
Proceedings of the ACL-02 conference on Empirical methods in natural
language processing, (Morristown, NJ, USA), pp. 214-221, Association
for Computational Linguistics, 2002.
[43] T. Strzalkowski and J. Wang, "A Self-learning Universal Concept Spotter,"
in Proceedings of the 16th conference on Computational linguistics,
(Morristown, NJ, USA), pp. 931-936, Association for Computational
Linguistics, 1996.
[44] R. Yangarber, W. Lin, and R. Grishman, "Unsupervised Learning of
Generalized Names," in Proceedings of the 19th international conference
on Computational linguistics, (Morristown, NJ, USA), pp. 1-7,
Association for Computational Linguistics, 2002.
[45] A. Ekbal, R. Haque, and S. Bandyopadhyay, "Bengali Part of Speech
Tagging using Conditional Random Field," in Proceedings of Seventh
International Symposium on Natural Language Processing (SNLP2007),
2007.
[46] A. Ekbal and S. Bandyopadhyay, "Lexicon Development and POS
Tagging using a Tagged Bengali News Corpus," in Proceedings of the
20th International Florida AI Research Society Conference (FLAIRS-
2007), (Florida), pp. 261-263, 2007.
[47] T. W. Anderson and S. Scolve, Introduction to the Statistical Analysis
of Data. Houghton Mifflin, 1978.
[48] W. S. Gosset, "The Probable Error of a Mean," in Biometrika, vol. 6,
pp. 1-25, 1908.