A Methodology for Automatic Diversification of Document Categories
Recently, numerous documents including large
volumes of unstructured data and text have been created because of the
rapid increase in the use of social media and the Internet. Usually,
these documents are categorized for the convenience of users. Because
the accuracy of manual categorization is not guaranteed, and such
categorization requires a large amount of time and incurs huge costs.
Many studies on automatic categorization have been conducted to help
mitigate the limitations of manual categorization. Unfortunately, most
of these methods cannot be applied to categorize complex documents
with multiple topics because they work on the assumption that
individual documents can be categorized into single categories only.
Therefore, to overcome this limitation, some studies have attempted to
categorize each document into multiple categories. However, the
learning process employed in these studies involves training using a
multi-categorized document set. These methods therefore cannot be
applied to the multi-categorization of most documents unless
multi-categorized training sets using traditional multi-categorization
algorithms are provided. To overcome this limitation, in this study, we
review our novel methodology for extending the category of a
single-categorized document to multiple categorizes, and then
introduce a survey-based verification scenario for estimating the
accuracy of our automatic categorization methodology.
[1] J. Hong, N. Kim, and S. Lee, “A Methodology for Automatic
Multi-Categorization of Single-Categorized Documents,” Journal of
Intelligent Information systems, vol. 20, no. 3, pp. 77-92, Sep. 2014.
[2] I. H. Witten, Text Mining, Practical Handbook of Internet Computing,
CRC Press, 2004.
[3] J. Hong, H. Choi, H. Han, J. Kim, E. Yu, S. Lim, and N. Kim, “A Data
Analysis-based Hybrid Methodology for Selecting Pending National
Issue Keywords,” Entrue Journal of Information Technology, vol. 13, pp.
97-111, Jun. 2014.
[4] R. J. Mooney, and R. Bunescu, “Mining Knowledge from Text Using
Information Extraction,” ACM SIGKDD Explorations, vol. 7, pp. 3-10,
Jun. 2006.
[5] S. Song, J. Yu, and E. Kim, “Offering System For Major Article Using
Text Mining and Data Mining,” Proceedings of the 32th annual
conference on Korea Information Processing Society, pp. 733-734, 2009.
[6] E. Yu, J. Kim, C. Lee, and N. Kim, “Using Ontologies for Semantic Text
Mining,” The Journal of Information Systems, vol. 21, pp. 137-161, Sep.
2012.
[7] D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel,
“Similarity Measures for Tracking Information Flow,” Proceedings of
CIKM, Bremen, Germany, 2005.
[8] C. J. V. Rijsbergen, Information Retrieval, 2nd edition, Butterworth,
1979.
[9] F. Sebastiani, Classification of Text, Automatic, The Encyclopedia of
Language and Linguistics 14, 2nd edition, Elsevier Science Pub, 2006.
[10] G. Salton, A. Wong, and C. S. Yang, “A Vector Space Model for
Automatic Indexing,” Communications of the ACM, vol. 18, pp. 613-620,
Nov. 1975.
[11] R. Albright, “Taming Text with the SVD,” SAS Institute Inc., 2006.
[12] G. Salton, and M. J. McGill, Introduction to Modern Information
Retrieval, McGraw Hill, 1983.
[13] C. Apte, and F. Damerau, “Automated Learning of Decision Rules for
Text Categorization,” ACM Transactions on Information Systems, vol.
12, pp. 233-251, Jul. 1994.
[14] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, 3rd ed.,
Morgan Kaufmann Publishers, 2011.
[15] H. Lim, and K. Nam, “Computer Science: Improving of KNN - Based
Korean Text Classifier by Using Heuristic Information,” The Journal of
Korean Association of Computer Education, vol. 5, pp. 37-44, Jul. 2002.
[16] Y. Yang, “Expert network: Effective and Efficient Learning from Human
Decisions in Text Categorization and Retrieval,” Proceedings of the 17th
International Conference on Research and Development in Information
Retrieval, SIGIR 94, pp. 13-22, 1994.
[17] D. D. Lewis, and M. Ringuette, “Comparison of Two Learning
Algorithms for Text Categorization”, Proceedings of the 13rd Annual
Symposium on Document Analysis and Information Retrieval, pp. 81-93,
1994. [18] E. Weiner, J. O. Pedersenm, and A. S. Weigend, “A Neural Network
Approach to Topic Spotting,” Proceedings of the 14th Annual Symposium
on Document Analysis and Information Retrieval, 1995.
[19] T. Joachims, Text Categorization with Support Vector Machines:
Learning with Many Relevant Features, Springer Berlin Heidelberg, pp.
137-142, 1998.
[20] J. In, J. Kim, and S. Chae, “Combined Feature Set and Hybrid Feature
Selection Method for Effective Document Classification,” Journal of
Internet Computing and Services, vol. 14, pp. 49-57, Oct. 2013.
[21] H. Lim, and D. Kim, “Using Mutual Information for Selecting Features in
Multi-label Classification,” Journal of KIISE: Software and Applications,
vol. 39, pp. 806-811, Oct. 2012.
[22] J. Yun, J. Lee, and D. Kim, “Feature Selection in Multi-label
Classification Using NSGA-II Algorithm,” Journal of KIISE: Software
and Applications, vol. 40, pp. 133-140, Mar. 2013.
[1] J. Hong, N. Kim, and S. Lee, “A Methodology for Automatic
Multi-Categorization of Single-Categorized Documents,” Journal of
Intelligent Information systems, vol. 20, no. 3, pp. 77-92, Sep. 2014.
[2] I. H. Witten, Text Mining, Practical Handbook of Internet Computing,
CRC Press, 2004.
[3] J. Hong, H. Choi, H. Han, J. Kim, E. Yu, S. Lim, and N. Kim, “A Data
Analysis-based Hybrid Methodology for Selecting Pending National
Issue Keywords,” Entrue Journal of Information Technology, vol. 13, pp.
97-111, Jun. 2014.
[4] R. J. Mooney, and R. Bunescu, “Mining Knowledge from Text Using
Information Extraction,” ACM SIGKDD Explorations, vol. 7, pp. 3-10,
Jun. 2006.
[5] S. Song, J. Yu, and E. Kim, “Offering System For Major Article Using
Text Mining and Data Mining,” Proceedings of the 32th annual
conference on Korea Information Processing Society, pp. 733-734, 2009.
[6] E. Yu, J. Kim, C. Lee, and N. Kim, “Using Ontologies for Semantic Text
Mining,” The Journal of Information Systems, vol. 21, pp. 137-161, Sep.
2012.
[7] D. Metzler, Y. Bernstein, W. B. Croft, A. Moffat, and J. Zobel,
“Similarity Measures for Tracking Information Flow,” Proceedings of
CIKM, Bremen, Germany, 2005.
[8] C. J. V. Rijsbergen, Information Retrieval, 2nd edition, Butterworth,
1979.
[9] F. Sebastiani, Classification of Text, Automatic, The Encyclopedia of
Language and Linguistics 14, 2nd edition, Elsevier Science Pub, 2006.
[10] G. Salton, A. Wong, and C. S. Yang, “A Vector Space Model for
Automatic Indexing,” Communications of the ACM, vol. 18, pp. 613-620,
Nov. 1975.
[11] R. Albright, “Taming Text with the SVD,” SAS Institute Inc., 2006.
[12] G. Salton, and M. J. McGill, Introduction to Modern Information
Retrieval, McGraw Hill, 1983.
[13] C. Apte, and F. Damerau, “Automated Learning of Decision Rules for
Text Categorization,” ACM Transactions on Information Systems, vol.
12, pp. 233-251, Jul. 1994.
[14] J. Han, and M. Kamber, Data Mining: Concepts and Techniques, 3rd ed.,
Morgan Kaufmann Publishers, 2011.
[15] H. Lim, and K. Nam, “Computer Science: Improving of KNN - Based
Korean Text Classifier by Using Heuristic Information,” The Journal of
Korean Association of Computer Education, vol. 5, pp. 37-44, Jul. 2002.
[16] Y. Yang, “Expert network: Effective and Efficient Learning from Human
Decisions in Text Categorization and Retrieval,” Proceedings of the 17th
International Conference on Research and Development in Information
Retrieval, SIGIR 94, pp. 13-22, 1994.
[17] D. D. Lewis, and M. Ringuette, “Comparison of Two Learning
Algorithms for Text Categorization”, Proceedings of the 13rd Annual
Symposium on Document Analysis and Information Retrieval, pp. 81-93,
1994. [18] E. Weiner, J. O. Pedersenm, and A. S. Weigend, “A Neural Network
Approach to Topic Spotting,” Proceedings of the 14th Annual Symposium
on Document Analysis and Information Retrieval, 1995.
[19] T. Joachims, Text Categorization with Support Vector Machines:
Learning with Many Relevant Features, Springer Berlin Heidelberg, pp.
137-142, 1998.
[20] J. In, J. Kim, and S. Chae, “Combined Feature Set and Hybrid Feature
Selection Method for Effective Document Classification,” Journal of
Internet Computing and Services, vol. 14, pp. 49-57, Oct. 2013.
[21] H. Lim, and D. Kim, “Using Mutual Information for Selecting Features in
Multi-label Classification,” Journal of KIISE: Software and Applications,
vol. 39, pp. 806-811, Oct. 2012.
[22] J. Yun, J. Lee, and D. Kim, “Feature Selection in Multi-label
Classification Using NSGA-II Algorithm,” Journal of KIISE: Software
and Applications, vol. 40, pp. 133-140, Mar. 2013.
@article{"International Journal of Information, Control and Computer Sciences:71120", author = "Dasom Kim and Chen Liu and Myungsu Lim and Soo-Hyeon Jeon and Byeoung Kug Jeon and Kee-Young Kwahk and Namgyu Kim", title = "A Methodology for Automatic Diversification of Document Categories", abstract = "Recently, numerous documents including large
volumes of unstructured data and text have been created because of the
rapid increase in the use of social media and the Internet. Usually,
these documents are categorized for the convenience of users. Because
the accuracy of manual categorization is not guaranteed, and such
categorization requires a large amount of time and incurs huge costs.
Many studies on automatic categorization have been conducted to help
mitigate the limitations of manual categorization. Unfortunately, most
of these methods cannot be applied to categorize complex documents
with multiple topics because they work on the assumption that
individual documents can be categorized into single categories only.
Therefore, to overcome this limitation, some studies have attempted to
categorize each document into multiple categories. However, the
learning process employed in these studies involves training using a
multi-categorized document set. These methods therefore cannot be
applied to the multi-categorization of most documents unless
multi-categorized training sets using traditional multi-categorization
algorithms are provided. To overcome this limitation, in this study, we
review our novel methodology for extending the category of a
single-categorized document to multiple categorizes, and then
introduce a survey-based verification scenario for estimating the
accuracy of our automatic categorization methodology.", keywords = "Big Data Analysis, Document Classification, Text
Mining, Topic Analysis.", volume = "9", number = "10", pages = "2207-6", }