Online Topic Model for Broadcasting Contents Using Semantic Correlation Information

This paper proposes a method of learning topics for
broadcasting contents. There are two kinds of texts related to
broadcasting contents. One is a broadcasting script, which is a series of
texts including directions and dialogues. The other is blogposts, which
possesses relatively abstracted contents, stories, and diverse
information of broadcasting contents. Although two texts range over
similar broadcasting contents, words in blogposts and broadcasting
script are different. When unseen words appear, it needs a method to
reflect to existing topic. In this paper, we introduce a semantic
vocabulary expansion method to reflect unseen words. We expand
topics of the broadcasting script by incorporating the words in
blogposts. Each word in blogposts is added to the most semantically
correlated topics. We use word2vec to get the semantic correlation
between words in blogposts and topics of scripts. The vocabularies of
topics are updated and then posterior inference is performed to
rearrange the topics. In experiments, we verified that the proposed
method can discover more salient topics for broadcasting contents.





References:
[1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of
word representations in vector space." arXiv preprint arXiv:1301.3781
2013.
[2] K. Zhai, and J. Boyd-Graber. "Online Latent Dirichlet Allocation with
Infinite Vocabulary." In Proceedings of The 30th International
Conference on Machine Learning, pp. 561-569, 2013.
[3] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation." the
Journal of machine Learning research, vol. 3, pp. 993-1022, 2003.
[4] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, "Hierarchical
dirichlet processes." The American statistical association, 2006.
[5] M. Hoffman, F. R. Bach, and D. M. Blei, "Online learning for latent
dirichlet allocation," Advances in neural information processing systems,
pp. 856-864, 2010.
[6] C. Wang, J. W. Paisley, and D. M. Blei, "Online variational inference for
the hierarchical Dirichlet process," In Proceedings of International
Conference on Artificial Intelligence and Statistics, pp. 752-760, 2011.
[7] H. Misra, F. Hopfgartner, A. Goyal, P.Punitha, and J. M. Mose, "TV news
story segmentation based on semantic coherence and content similarity."
Advances in Multimedia Modeling, pp. 347-357. 2010.
[8] C. Engels, K. Deschacht, J. H. Becker, T. Tuytleaars, M-F. Moens, and L.
V. Gool, "Automatic annotation of unique locations from video and text,"
BMVC, pp 1-11, 2010.
[9] D. O’Callaghan, D. Greene, J. Carthy, and P. Cunningham, "An analysis
of the coherence of descriptors in topic modeling". Expert Systems with
Applications, Vol. 42(13), pp. 5645-5657. 2015.
[10] G. Bouma. "Normalized (pointwise) mutual information in collocation
extraction." In Proceedings of GSCL, pp. 31-40, 2009.