Abstract: This paper proposes a method of learning topics for
broadcasting contents. There are two kinds of texts related to
broadcasting contents. One is a broadcasting script, which is a series of
texts including directions and dialogues. The other is blogposts, which
possesses relatively abstracted contents, stories, and diverse
information of broadcasting contents. Although two texts range over
similar broadcasting contents, words in blogposts and broadcasting
script are different. When unseen words appear, it needs a method to
reflect to existing topic. In this paper, we introduce a semantic
vocabulary expansion method to reflect unseen words. We expand
topics of the broadcasting script by incorporating the words in
blogposts. Each word in blogposts is added to the most semantically
correlated topics. We use word2vec to get the semantic correlation
between words in blogposts and topics of scripts. The vocabularies of
topics are updated and then posterior inference is performed to
rearrange the topics. In experiments, we verified that the proposed
method can discover more salient topics for broadcasting contents.
Abstract: Scripts are one of the basic text resources to understand
broadcasting contents. Topic modeling is the method to get the
summary of the broadcasting contents from its scripts. Generally,
scripts represent contents descriptively with directions and speeches,
and provide scene segments that can be seen as semantic units.
Therefore, a script can be topic modeled by treating a scene segment
as a document. Because scene segments consist of speeches mainly,
however, relatively small co-occurrences among words in the scene
segments are observed. This causes inevitably the bad quality of
topics by statistical learning method. To tackle this problem, we
propose a method to improve topic quality with additional word
co-occurrence information obtained using scene similarities. The
main idea of improving topic quality is that the information that
two or more texts are topically related can be useful to learn high
quality of topics. In addition, more accurate topical representations
lead to get information more accurate whether two texts are related
or not. In this paper, we regard two scene segments are related
if their topical similarity is high enough. We also consider that
words are co-occurred if they are in topically related scene segments
together. By iteratively inferring topics and determining semantically
neighborhood scene segments, we draw a topic space represents
broadcasting contents well. In the experiments, we showed the
proposed method generates a higher quality of topics from Korean
drama scripts than the baselines.