A Methodology for Automatic Diversification of Document Categories

Recently, numerous documents including large volumes of unstructured data and text have been created because of the rapid increase in the use of social media and the Internet. Usually, these documents are categorized for the convenience of users. Because the accuracy of manual categorization is not guaranteed, and such categorization requires a large amount of time and incurs huge costs. Many studies on automatic categorization have been conducted to help mitigate the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorize complex documents with multiple topics because they work on the assumption that individual documents can be categorized into single categories only. Therefore, to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, the learning process employed in these studies involves training using a multi-categorized document set. These methods therefore cannot be applied to the multi-categorization of most documents unless multi-categorized training sets using traditional multi-categorization algorithms are provided. To overcome this limitation, in this study, we review our novel methodology for extending the category of a single-categorized document to multiple categorizes, and then introduce a survey-based verification scenario for estimating the accuracy of our automatic categorization methodology.




References:
[1] J. Hong, N. Kim, and S. Lee, “A Methodology for Automatic
Multi-Categorization of Single-Categorized Documents,” Journal of
Intelligent Information systems, vol. 20, no. 3, pp. 77-92, Sep. 2014.
[2] I. H. Witten, Text Mining, Practical Handbook of Internet Computing,
CRC Press, 2004.
[3] J. Hong, H. Choi, H. Han, J. Kim, E. Yu, S. Lim, and N. Kim, “A Data
Analysis-based Hybrid Methodology for Selecting Pending National
Issue Keywords,” Entrue Journal of Information Technology, vol. 13, pp.
97-111, Jun. 2014.
[4] R. J. Mooney, and R. Bunescu, “Mining Knowledge from Text Using
Information Extraction,” ACM SIGKDD Explorations, vol. 7, pp. 3-10,
Jun. 2006.
[5] S. Song, J. Yu, and E. Kim, “Offering System For Major Article Using
Text Mining and Data Mining,” Proceedings of the 32th annual
conference on Korea Information Processing Society, pp. 733-734, 2009.
[6] E. Yu, J. Kim, C. Lee, and N. Kim, “Using Ontologies for Semantic Text
Mining,” The Journal of Information Systems, vol. 21, pp. 137-161, Sep.
2012.
[7] D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel,
“Similarity Measures for Tracking Information Flow,” Proceedings of
CIKM, Bremen, Germany, 2005.
[8] C. J. V. Rijsbergen, Information Retrieval, 2nd edition, Butterworth,
1979.
[9] F. Sebastiani, Classification of Text, Automatic, The Encyclopedia of
Language and Linguistics 14, 2nd edition, Elsevier Science Pub, 2006.
[10] G. Salton, A. Wong, and C. S. Yang, “A Vector Space Model for
Automatic Indexing,” Communications of the ACM, vol. 18, pp. 613-620,
Nov. 1975.
[11] R. Albright, “Taming Text with the SVD,” SAS Institute Inc., 2006.
[12] G. Salton, and M. J. McGill, Introduction to Modern Information
Retrieval, McGraw Hill, 1983.
[13] C. Apte, and F. Damerau, “Automated Learning of Decision Rules for
Text Categorization,” ACM Transactions on Information Systems, vol.
12, pp. 233-251, Jul. 1994.
[14] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, 3rd ed.,
Morgan Kaufmann Publishers, 2011.
[15] H. Lim, and K. Nam, “Computer Science: Improving of KNN - Based
Korean Text Classifier by Using Heuristic Information,” The Journal of
Korean Association of Computer Education, vol. 5, pp. 37-44, Jul. 2002.
[16] Y. Yang, “Expert network: Effective and Efficient Learning from Human
Decisions in Text Categorization and Retrieval,” Proceedings of the 17th
International Conference on Research and Development in Information
Retrieval, SIGIR 94, pp. 13-22, 1994.
[17] D. D. Lewis, and M. Ringuette, “Comparison of Two Learning
Algorithms for Text Categorization”, Proceedings of the 13rd Annual
Symposium on Document Analysis and Information Retrieval, pp. 81-93,
1994. [18] E. Weiner, J. O. Pedersenm, and A. S. Weigend, “A Neural Network
Approach to Topic Spotting,” Proceedings of the 14th Annual Symposium
on Document Analysis and Information Retrieval, 1995.
[19] T. Joachims, Text Categorization with Support Vector Machines:
Learning with Many Relevant Features, Springer Berlin Heidelberg, pp.
137-142, 1998.
[20] J. In, J. Kim, and S. Chae, “Combined Feature Set and Hybrid Feature
Selection Method for Effective Document Classification,” Journal of
Internet Computing and Services, vol. 14, pp. 49-57, Oct. 2013.
[21] H. Lim, and D. Kim, “Using Mutual Information for Selecting Features in
Multi-label Classification,” Journal of KIISE: Software and Applications,
vol. 39, pp. 806-811, Oct. 2012.
[22] J. Yun, J. Lee, and D. Kim, “Feature Selection in Multi-label
Classification Using NSGA-II Algorithm,” Journal of KIISE: Software
and Applications, vol. 40, pp. 133-140, Mar. 2013.