Improving Topic Quality of Scripts by Using Scene Similarity Based Word Co-Occurrence

Scripts are one of the basic text resources to understand broadcasting contents. Topic modeling is the method to get the summary of the broadcasting contents from its scripts. Generally, scripts represent contents descriptively with directions and speeches, and provide scene segments that can be seen as semantic units. Therefore, a script can be topic modeled by treating a scene segment as a document. Because scene segments consist of speeches mainly, however, relatively small co-occurrences among words in the scene segments are observed. This causes inevitably the bad quality of topics by statistical learning method. To tackle this problem, we propose a method to improve topic quality with additional word co-occurrence information obtained using scene similarities. The main idea of improving topic quality is that the information that two or more texts are topically related can be useful to learn high quality of topics. In addition, more accurate topical representations lead to get information more accurate whether two texts are related or not. In this paper, we regard two scene segments are related if their topical similarity is high enough. We also consider that words are co-occurred if they are in topically related scene segments together. By iteratively inferring topics and determining semantically neighborhood scene segments, we draw a topic space represents broadcasting contents well. In the experiments, we showed the proposed method generates a higher quality of topics from Korean drama scripts than the baselines.




References:
[1] D. M. Blei, “Probabilistic topic models,” Communications of the ACM,
vol. 55, no. 4, pp. 77–84, 2012.
[2] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003.
[3] K. R. Canini, L. Shi, and T. L. Griffiths, “Online inference of topics
with latent dirichlet allocation,” in International conference on artificial
intelligence and statistics, 2009, pp. 65–72.
[4] Z. Chen and B. Liu, “Topic modeling using topics from many domains,
lifelong learning and big data,” in Proceedings of the 31st International
Conference on Machine Learning (ICML-14), 2014, pp. 703–711.
[5] J. Du, J. Jiang, D. Song, and L. Liao, “Topic modeling with
document relative similarities,” in Proceedings of the 24th International
Conference on Artificial Intelligence. AAAI Press, 2015, pp.
3469–3475.
[6] A. Gruber, M. Rosen-Zvi, and Y. Weiss, “Latent topic models for
hypertext,” in Proceedings of the Twenty-Fourth Conference Annual
Conference on Uncertainty in Artificial Intelligence (UAI-08). AUAI
Press, 2008, pp. 230–239.
[7] Y.-J. Han, S.-Y. Park, and S.-B. Park, “A single-directional influence
topic model using call and proximity logs simultaneously,” Soft
Computing, pp. 1–17, 2015.
[8] H. Mahmoud, P´olya urn models. CRC press, 2008.
[9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their
compositionality,” in Advances in neural information processing
systems, 2013, pp. 3111–3119.
[10] D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum,
“Optimizing semantic coherence in topic models,” in Proceedings of
the Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 2011, pp. 262–272. [11] H. Misra, F. Hopfgartner, A. Goyal, P. Punitha, and J. M. Jose, “Tv news
story segmentation based on semantic coherence and content similarity,”
in Advances in Multimedia Modeling. Springer, 2010, pp. 347–357.
[12] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin, “Automatic
evaluation of topic coherence,” in Human Language Technologies:
The 2010 Annual Conference of the North American Chapter of
the Association for Computational Linguistics. Association for
Computational Linguistics, 2010, pp. 100–108.
[13] M. Purver, T. L. Griffiths, K. P. K¨ording, and J. B. Tenenbaum,
“Unsupervised topic modelling for multi-party spoken discourse,” in
Proceedings of the 21st International Conference on Computational
Linguistics and the 44th annual meeting of the Association for
Computational Linguistics. Association for Computational Linguistics,
2006, pp. 17–24.
[14] P. Xie, D. Yang, and E. Xing, “Incorporating word correlation knowledge
into topic modeling,” in Proceedings of the 2015 Conference of the North
American Chapter of the Association for Computational Linguistics:
Human Language Technologies. Association for Computational
Linguistics, 2015, pp. 725–734.