Abstract: Scripts are one of the basic text resources to understand
broadcasting contents. Topic modeling is the method to get the
summary of the broadcasting contents from its scripts. Generally,
scripts represent contents descriptively with directions and speeches,
and provide scene segments that can be seen as semantic units.
Therefore, a script can be topic modeled by treating a scene segment
as a document. Because scene segments consist of speeches mainly,
however, relatively small co-occurrences among words in the scene
segments are observed. This causes inevitably the bad quality of
topics by statistical learning method. To tackle this problem, we
propose a method to improve topic quality with additional word
co-occurrence information obtained using scene similarities. The
main idea of improving topic quality is that the information that
two or more texts are topically related can be useful to learn high
quality of topics. In addition, more accurate topical representations
lead to get information more accurate whether two texts are related
or not. In this paper, we regard two scene segments are related
if their topical similarity is high enough. We also consider that
words are co-occurred if they are in topically related scene segments
together. By iteratively inferring topics and determining semantically
neighborhood scene segments, we draw a topic space represents
broadcasting contents well. In the experiments, we showed the
proposed method generates a higher quality of topics from Korean
drama scripts than the baselines.
Abstract: Text similarity measurement is a fundamental issue in
many textual applications such as document clustering, classification,
summarization and question answering. However, prevailing approaches
based on Vector Space Model (VSM) more or less suffer
from the limitation of Bag of Words (BOW), which ignores the semantic
relationship among words. Enriching document representation
with background knowledge from Wikipedia is proven to be an effective
way to solve this problem, but most existing methods still
cannot avoid similar flaws of BOW in a new vector space. In this
paper, we propose a novel text similarity measurement which goes
beyond VSM and can find semantic affinity between documents.
Specifically, it is a unified graph model that exploits Wikipedia as
background knowledge and synthesizes both document representation
and similarity computation. The experimental results on two different
datasets show that our approach significantly improves VSM-based
methods in both text clustering and classification.