Web Search Engine Based Naming Procedure for Independent Topic

In recent years, the number of document data has been
increasing since the spread of the Internet. Many methods have been
studied for extracting topics from large document data. We proposed
Independent Topic Analysis (ITA) to extract topics independent of
each other from large document data such as newspaper data. ITA is a
method for extracting the independent topics from the document data
by using the Independent Component Analysis. The topic represented
by ITA is represented by a set of words. However, the set of words
is quite different from the topics the user imagines. For example,
the top five words with high independence of a topic are as follows.
Topic1 = {"scor", "game", "lead", "quarter", "rebound"}. This Topic
1 is considered to represent the topic of "SPORTS". This topic name
"SPORTS" has to be attached by the user. ITA cannot name topics.
Therefore, in this research, we propose a method to obtain topics easy
for people to understand by using the web search engine, topics given
by the set of words given by independent topic analysis. In particular,
we search a set of topical words, and the title of the homepage of
the search result is taken as the topic name. And we also use the
proposed method for some data and verify its effectiveness.




References:
[1] Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet
allocation, The Journal of Machine Learning Research, Vol. 3, pp.
993–1022.
[2] Blei, D. M. 2012. Probabilistic topic models, Commun. ACM, Vol. 55,
No. 4, pp. 77–84.
[3] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and
Harshman, R. 1990. Indexing by latent semantic analysis, Journal of the
American Society of Information Science, Vol. 41, No. 6, pp. 391–407.
[4] Hofmann, T. 1999. Probabilistic latent semantic analysis, Proceedings
of the 15th Conference on Uncertainty in Artificial Intelligence (UAI’99),
pp. 289–29, Morgan Kaufmann Publishers Inc..
[5] Hyv  arinen A. 1999. Fast and robust fixed-point algorithms for
independent component analysis, IEEE Trans. on Neural Networks, Vol.
10, No. 3.
[6] Hyv  arinen, A., Karhunen, J. and Oja, E. 2001. Independent component
analysis, John Wiley & Sons.
[7] Lichman, M. 2013. UCI machine learning repository,
http://archive.ics.uci.edu/ml , Accessed on 11/11/2016.
[8] Salton, G., Fox, E. A., Wu, H. 1983. Extended boolean information
retrieval, Commun. ACM, Vol. 26, No. 11, pp. 1022–1036.
[9] Shinohara, Y. 1999. Independent Topic Analysis : Extraction of
Characteristic Topics by maximization of Independence, Technical report
of IEICE.
[10] Shinohara, Y. 2000. Development of Browsing Assistance System for
finding Primary Topics and Tracking their Changes in a Document
Database, CRIEPI Research Report.
[11] Sirovich, I., and Kirby, M., 1987. Low-Dimensional procedure for the
caracterization of human faces, Journal of Optical Society of America
A, Vol.4, No.3, pp.519–524.
[12] Tanaka, M, Shinohara, Y. 2003. Topic-Based Dynamic Document
Management System for discovering Important and New Topics, CRIEPI
Research Report.
[13] Zhao, Y. and Karypis, G. 2002. Evaluation of hierarchical clustering
algorithms for document datasets, Conference of Information and
Knowledge Management (CIKM), pp. 515–524, ACM.
[14] Zhong, S., and Ghosh, J. 2003. A comparative study of generative
models for document clustering, Data Mining Workshop on Clustering
High Dimensional Data and Its Applications.
[15] google-search 1.0.2, https://pypi.org/project/google-search/, 2018/11/15