Mining News Sites to Create Special Domain News Collections

We present a method to create special domain collections from news sites. The method only requires a single sample article as a seed. No prior corpus statistics are needed and the method is applicable to multiple languages. We examine various similarity measures and the creation of document collections for English and Japanese. The main contributions are as follows. First, the algorithm can build special domain collections from as little as one sample document. Second, unlike other algorithms it does not require a second “general" corpus to compute statistics. Third, in our testing the algorithm outperformed others in creating collections made up of highly relevant articles.




References:
[1] Dragomir Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep
Grewal, "Probabilistic question answering on the web", in WWW -02:
Proceedings of the 11th international conference on World Wide Web,
New York, NY, USA, 2002, pp. 408-419, ACM Press.
[2] Dmitri Roussinov and Jose Robles, "Learning patterns to answer open
domain questions on the web", in SIGIR -04: Proceedings of the 27th
annual international ACM SIGIR conference on Research and
development in information retrieval, New York, NY, USA, 2004, pp.
500-501, ACM Press.
[3] P. Resnik and N. A. Smith, "The web as a parallel corpus",
Computational Linguistics, vol. 29, pp. 349-380, 2003.
[4] Mirella Lapata and Frank Keller, "Web-based models for natural
language processing", ACM Trans. Speech Lang. Process., vol. 2, no. 1,
pp. 1-31, 2005.
[5] William H. Fletcher, "Facilitating the compilation and dissemination of
ad-hoc web corpora", in Papers from the Fifth International Conference
on Teaching and Language Corpora, 2004.
[6] M. Baroni and S. Bernardini, "Bootcat: Bootstrapping corpora and terms
from the web", in Proceedings of LREC 2004, 2004.
[7] Sara Castagnoli, Using the Web as a Source of LSP Corpora in the
Terminology Classroom, chapter 6, pp. 159-172, GEDIT, 2006.
[8] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, "Focused
crawling: a new approach to topic-specific Web resource discovery",
Computer Networks (Amsterdam, Netherlands: 1999), vol. 31, no. 11-
16, pp. 1623-1640, 1999.
[9] G. Salton and C. Buckley, "Improving retrieval performance by
relevance feedback", Journal of the American Society for Information
Science, vol. 41, pp. 288-297, 1990.
[10] Cdrick Fairon, "Corporator: A tool for creating rss-based specialized
corpora", in Proceedings of the 2nd International Workshop on Web as
Corpus, 2006.
[11] David B. Bracewell, Fuji Ren, and Shingo Kuroiwa, "Multilingual single
document keyword extraction for information retrieval", in Proceedings
of the 2005 IEEE International Conference on Natural Language
Processing and Knowledge Engineering, Wuhan, China, November
2005.
[12] M.F. Porter, "An algorithm for suffix stripping", Program, vol. 14, pp.
130-137, 1980.
[13] E. Brill, "A simple rule-based part-of-speech tagger", in Proceedings of
3rd Applied Natural Language Processing, 1992, pp. 152-155.
[14] Yuji Matsumoto, Akira Kitauchi, Tatsuo Yamashita, Yoshitaka Hirano,
Hiroshi Matsuda, Kazuma Takaoka, and Masayuki Asahara,
"Morphological analysis system chasen version 2.2.9 manual.", Tech.
Rep., Nara Institute of Science and Technology, 2002.
[15] R. C. J. van Rijsbergen, Information Retrieval: Second Edition,
Butterworth-Heinemann, 1979.
[16] Gerald Salton, Automatic Text Processing, Addison-Wesley Publishing
Company, 1998.
[17] David B. Bracewell, Fuji Ren, and Shingo Kuroiwa, "Category
classification and topic discovery of news articles", in Proceedings of
Information-MFCSIT 2006, 2006, pp. 345-348.