Mining News Sites to Create Special Domain News Collections
We present a method to create special domain
collections from news sites. The method only requires a single
sample article as a seed. No prior corpus statistics are needed and the
method is applicable to multiple languages. We examine various
similarity measures and the creation of document collections for
English and Japanese. The main contributions are as follows. First,
the algorithm can build special domain collections from as little as
one sample document. Second, unlike other algorithms it does not
require a second “general" corpus to compute statistics. Third, in our
testing the algorithm outperformed others in creating collections
made up of highly relevant articles.
[1] Dragomir Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep
Grewal, "Probabilistic question answering on the web", in WWW -02:
Proceedings of the 11th international conference on World Wide Web,
New York, NY, USA, 2002, pp. 408-419, ACM Press.
[2] Dmitri Roussinov and Jose Robles, "Learning patterns to answer open
domain questions on the web", in SIGIR -04: Proceedings of the 27th
annual international ACM SIGIR conference on Research and
development in information retrieval, New York, NY, USA, 2004, pp.
500-501, ACM Press.
[3] P. Resnik and N. A. Smith, "The web as a parallel corpus",
Computational Linguistics, vol. 29, pp. 349-380, 2003.
[4] Mirella Lapata and Frank Keller, "Web-based models for natural
language processing", ACM Trans. Speech Lang. Process., vol. 2, no. 1,
pp. 1-31, 2005.
[5] William H. Fletcher, "Facilitating the compilation and dissemination of
ad-hoc web corpora", in Papers from the Fifth International Conference
on Teaching and Language Corpora, 2004.
[6] M. Baroni and S. Bernardini, "Bootcat: Bootstrapping corpora and terms
from the web", in Proceedings of LREC 2004, 2004.
[7] Sara Castagnoli, Using the Web as a Source of LSP Corpora in the
Terminology Classroom, chapter 6, pp. 159-172, GEDIT, 2006.
[8] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, "Focused
crawling: a new approach to topic-specific Web resource discovery",
Computer Networks (Amsterdam, Netherlands: 1999), vol. 31, no. 11-
16, pp. 1623-1640, 1999.
[9] G. Salton and C. Buckley, "Improving retrieval performance by
relevance feedback", Journal of the American Society for Information
Science, vol. 41, pp. 288-297, 1990.
[10] Cdrick Fairon, "Corporator: A tool for creating rss-based specialized
corpora", in Proceedings of the 2nd International Workshop on Web as
Corpus, 2006.
[11] David B. Bracewell, Fuji Ren, and Shingo Kuroiwa, "Multilingual single
document keyword extraction for information retrieval", in Proceedings
of the 2005 IEEE International Conference on Natural Language
Processing and Knowledge Engineering, Wuhan, China, November
2005.
[12] M.F. Porter, "An algorithm for suffix stripping", Program, vol. 14, pp.
130-137, 1980.
[13] E. Brill, "A simple rule-based part-of-speech tagger", in Proceedings of
3rd Applied Natural Language Processing, 1992, pp. 152-155.
[14] Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano,
Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara,
"Morphological analysis system chasen version 2.2.9 manual.", Tech.
Rep., Nara Institute of Science and Technology, 2002.
[15] R. C. J. van Rijsbergen, Information Retrieval: Second Edition,
Butterworth-Heinemann, 1979.
[16] Gerald Salton, Automatic Text Processing, Addison-Wesley Publishing
Company, 1998.
[17] David B. Bracewell, Fuji Ren, and Shingo Kuroiwa, "Category
classification and topic discovery of news articles", in Proceedings of
Information-MFCSIT 2006, 2006, pp. 345-348.
[1] Dragomir Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep
Grewal, "Probabilistic question answering on the web", in WWW -02:
Proceedings of the 11th international conference on World Wide Web,
New York, NY, USA, 2002, pp. 408-419, ACM Press.
[2] Dmitri Roussinov and Jose Robles, "Learning patterns to answer open
domain questions on the web", in SIGIR -04: Proceedings of the 27th
annual international ACM SIGIR conference on Research and
development in information retrieval, New York, NY, USA, 2004, pp.
500-501, ACM Press.
[3] P. Resnik and N. A. Smith, "The web as a parallel corpus",
Computational Linguistics, vol. 29, pp. 349-380, 2003.
[4] Mirella Lapata and Frank Keller, "Web-based models for natural
language processing", ACM Trans. Speech Lang. Process., vol. 2, no. 1,
pp. 1-31, 2005.
[5] William H. Fletcher, "Facilitating the compilation and dissemination of
ad-hoc web corpora", in Papers from the Fifth International Conference
on Teaching and Language Corpora, 2004.
[6] M. Baroni and S. Bernardini, "Bootcat: Bootstrapping corpora and terms
from the web", in Proceedings of LREC 2004, 2004.
[7] Sara Castagnoli, Using the Web as a Source of LSP Corpora in the
Terminology Classroom, chapter 6, pp. 159-172, GEDIT, 2006.
[8] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, "Focused
crawling: a new approach to topic-specific Web resource discovery",
Computer Networks (Amsterdam, Netherlands: 1999), vol. 31, no. 11-
16, pp. 1623-1640, 1999.
[9] G. Salton and C. Buckley, "Improving retrieval performance by
relevance feedback", Journal of the American Society for Information
Science, vol. 41, pp. 288-297, 1990.
[10] Cdrick Fairon, "Corporator: A tool for creating rss-based specialized
corpora", in Proceedings of the 2nd International Workshop on Web as
Corpus, 2006.
[11] David B. Bracewell, Fuji Ren, and Shingo Kuroiwa, "Multilingual single
document keyword extraction for information retrieval", in Proceedings
of the 2005 IEEE International Conference on Natural Language
Processing and Knowledge Engineering, Wuhan, China, November
2005.
[12] M.F. Porter, "An algorithm for suffix stripping", Program, vol. 14, pp.
130-137, 1980.
[13] E. Brill, "A simple rule-based part-of-speech tagger", in Proceedings of
3rd Applied Natural Language Processing, 1992, pp. 152-155.
[14] Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano,
Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara,
"Morphological analysis system chasen version 2.2.9 manual.", Tech.
Rep., Nara Institute of Science and Technology, 2002.
[15] R. C. J. van Rijsbergen, Information Retrieval: Second Edition,
Butterworth-Heinemann, 1979.
[16] Gerald Salton, Automatic Text Processing, Addison-Wesley Publishing
Company, 1998.
[17] David B. Bracewell, Fuji Ren, and Shingo Kuroiwa, "Category
classification and topic discovery of news articles", in Proceedings of
Information-MFCSIT 2006, 2006, pp. 345-348.
@article{"International Journal of Information, Control and Computer Sciences:51293", author = "David B. Bracewell and Fuji Ren and Shingo Kuroiwa", title = "Mining News Sites to Create Special Domain News Collections", abstract = "We present a method to create special domain
collections from news sites. The method only requires a single
sample article as a seed. No prior corpus statistics are needed and the
method is applicable to multiple languages. We examine various
similarity measures and the creation of document collections for
English and Japanese. The main contributions are as follows. First,
the algorithm can build special domain collections from as little as
one sample document. Second, unlike other algorithms it does not
require a second “general" corpus to compute statistics. Third, in our
testing the algorithm outperformed others in creating collections
made up of highly relevant articles.", keywords = "Information Retrieval, News, Special DomainCollections,", volume = "2", number = "6", pages = "1842-8", }