Extraction of Significant Phrases from Text

Prospective readers can quickly determine whether a document is relevant to their information need if the significant phrases (or keyphrases) in this document are provided. Although keyphrases are useful, not many documents have keyphrases assigned to them, and manually assigning keyphrases to existing documents is costly. Therefore, there is a need for automatic keyphrase extraction. This paper introduces a new domain independent keyphrase extraction algorithm. The algorithm approaches the problem of keyphrase extraction as a classification task, and uses a combination of statistical and computational linguistics techniques, a new set of attributes, and a new machine learning method to distinguish keyphrases from non-keyphrases. The experiments indicate that this algorithm performs better than other keyphrase extraction tools and that it significantly outperforms Microsoft Word 2000-s AutoSummarize feature. The domain independence of this algorithm has also been confirmed in our experiments.


Authors:



References:
[1] E. D-Avanzo, B. Magnini and A. Vallin, "Keyphrase extraction for
summarization purposes: the LAKE system at DUC-2004", Document
Understanding Workshop, Boston, USA, 2004.
[2] E. D-Avanzo and B. Magnini, "A keyphrase-based approach to
summarization: the LAKE system at DUC-2005", Document
Understanding Workshop, Vancouver, Canada, 2005.
[3] R. Fishkin and J. Pollard, "Search engine ranking factors v2",
http://www.seomoz.org/article/search-ranking-factors, 2007.
[4] E. Frank, G. Paynter, I. Witten, C. Gutwin and C. Nevill-Manning,
"Domain-specific keyphrase extraction", Proceedings of 16th
International Joint Conference on Artificial Intelligence, California,
USA, Morgan Kaufmann, pp. 668-673, 1999.
[5] Y. Lui, "An improved keyphrase extraction algorithm", Proceedings of
PREP2005, Lancaster, UK, 2005.
[6] Y. Lui, R. Brent and A. Calinescu, "Extracting significant phrases from
text", Proceedings of IEEE Data Mining and Information Retrieval,
Ontario, Canada, IEEE Computer, pp. 361-366, 2007.
[7] I. Mani, "Automatic summarization", John Benjamins, 2001.
[8] O. Medelyan and I. Witten, "Thesaurus based automatic keyphrase
indexing", Proceedings of 6th ACM/ IEEE-CS Joint Conference on
Digital Libraries, North Carolina, USA, ACM Press, pp. 296-297, 2006.
[9] R. Quinlan, "C4.5: programs for machine learning", Morgan Kaufmann,
1993.
[10] G. Salton and C. Buckley, "Term-weighting approaches in automatic
text retrieval", Information Processing and Management, Vol. 24, No. 5,
pp. 513-523, 1988.
[11] G. Salton and M. McGill, "Introduction to modern information
retrieval", McGraw-Hill, 1983.
[12] M. Song, I. Song and X. Hu, "KPSpotter: a flexible information gainbased
keyphrase extraction system", Proceedings of 5th ACM
International Workshop on Web Information and Data Management,
Louisiana, USA, ACM Press, pp. 50-53, 2003.
[13] D. Sullivan, "Death of a meta tag", http://searchenginewatch.com/
showPage.html? page=2165061, 2002.
[14] P. Tan, M. Steinbach and V. Kumar, "Introduction to data mining",
Addison-Wesley, 2006.
[15] P. Turney, "Extraction of keyphrases from text: evaluation of four
algorithms", Technical Report ERB-1051, National Research Council of
Canada, 1997.
[16] P. Turney, "Learning to extract keyphrases from text", Technical Report
ERB-1057, National Research Council of Canada, 1999.
[17] P. Turney, "Coherent keyphrase extraction via web mining",
Proceedings of 18th International Joint Conference on Artificial
Intelligence, Acapulco, Mexico, CogPrints, pp. 434-439, 2003.
[18] I. Witten and E. Frank, "Data mining: practical machine learning tools
and techniques with Java implementations", Morgan Kaufmann, 2000.
[19] Y. Zhang, N. Zincir-Heywood and E. Milios, "World wide web site
summarization", Web Intelligence and Agent Systems, Vol. 2, Issue 1,
pp. 39-53, 2004.