Abstract: Prospective readers can quickly determine whether a document is relevant to their information need if the significant phrases (or keyphrases) in this document are provided. Although keyphrases are useful, not many documents have keyphrases assigned to them, and manually assigning keyphrases to existing documents is costly. Therefore, there is a need for automatic keyphrase extraction. This paper introduces a new domain independent keyphrase extraction algorithm. The algorithm approaches the problem of keyphrase extraction as a classification task, and uses a combination of statistical and computational linguistics techniques, a new set of attributes, and a new machine learning method to distinguish keyphrases from non-keyphrases. The experiments indicate that this algorithm performs better than other keyphrase extraction tools and that it significantly outperforms Microsoft Word 2000-s AutoSummarize feature. The domain independence of this algorithm has also been confirmed in our experiments.
Abstract: Automatic keyphrase extraction is useful in efficiently
locating specific documents in online databases. While several
techniques have been introduced over the years, improvement on
accuracy rate is minimal. This research examines attribute scores for
author-supplied keyphrases to better understand how the scores affect
the accuracy rate of automatic keyphrase extraction. Five attributes
are chosen for examination: Term Frequency, First Occurrence, Last
Occurrence, Phrase Position in Sentences, and Term Cohesion
Degree. The results show that First Occurrence is the most reliable
attribute. Term Frequency, Last Occurrence and Term Cohesion
Degree display a wide range of variation but are still usable with
suggested tweaks. Only Phrase Position in Sentences shows a totally
unpredictable pattern. The results imply that the commonly used
ranking approach which directly extracts top ranked potential phrases
from candidate keyphrase list as the keyphrases may not be reliable.