Evolutionary Feature Selection for Text Documents using the SVM
Text categorization is the problem of classifying text
documents into a set of predefined classes. After a preprocessing
step, the documents are typically represented as large sparse vectors.
When training classifiers on large collections of documents, both the
time and memory restrictions can be quite prohibitive. This justifies
the application of feature selection methods to reduce the
dimensionality of the document-representation vector. In this paper,
we present three feature selection methods: Information Gain,
Support Vector Machine feature selection called (SVM_FS) and
Genetic Algorithm with SVM (called GA_SVM). We show that the
best results were obtained with GA_SVM method for a relatively
small dimension of the feature vector.
[1] S. Chakrabarti, "Mining the Web- Discovering Knowledge from
hypertext data", Morgan Kaufmann Press, 2003.
[2] G. Forman, "A Pitfall and Solution in Multi-Class Feature Selection for
Text Classification", Proceedings of the 21st International Conference
on Machine Learning, Banff, Canada, 2004.
[3] T. Jebara, "Multi Task Feature and Kernel Selection for SVMs",
Proceedings of the 21st International Conference on Machine Learning,
Banff, Canada, 2004.
[4] T. Mitchell, "Machine Learning", McGraw Hill Publishers, 1997.
[5] D. Mladenic, J. Brank, M. Grobelnik and N. Milic-Frayling, "Feature
Selection Using Support Vector Machines", The 27th Annual
International ACM SIGIR Conference (SIGIR2004), pp 234-241, 2004.
[6] D. Morariu, "Classification and Clustering using Support Vector
Machine", 2nd PhD Report, University ÔÇ×Lucian Blaga" of Sibiu,
September, 2005, http://webspace.ulbsibiu.ro/ daniel.morariu/html/Docs
/Report2.pdf.
[7] D. Morariu, L. Vintan, "A Better Correlation of the SVM kernel-s
Parameters", Proceeding of The 5th RoEduNet International Conference,
Sibiu, June 2006.
[8] C. Nello, J. Swawe-Taylor, "An introduction to Support Vector
Machines", Cambridge University Press, 2000.
[9] J. Platt, "Fast training of support vector machines using sequential
minimal optimization". In B. Scholkopf, C. J. C. Burges, and A. J.
Smola, editors, Advances in Kernel Methods - Support Vector Learning,
pages 185-208, Cambridge, MA, 1999, MIT Press.
[10] Reuters Corpus: http://about.reuters.com/researchandstandards/corpus/.
Released in November 2000.
[11] B. Schoelkopf, A. Smola, "Learning with Kernels, Support Vector
Machines", MIT Press, London, 2002.
[12] Whitely, D., A genetic Algorithm Tutorial, Foundations of Genetic
Algorithms, ed. Morgan Kaufmann
[13] G, F. Luger, W. A. Stubblefield, Artificial Intelligence, Addison Wesley
Longman, Third Edition, 1998
[14] G. Kim, S. Kim, Feature Selection Using Genetic Algorithms for
Handwritten Character Recognition, Proceedings of the Seventh
International Workshop on Frontiers in Handwriting Recognition,
Amsterdam, 2000
[15] A. E. Eiben, J. E. Smith, Introduction to evolutionary computing,
Springer-Verlag, 2003
[16] D. Morariu, L. Vintan, V. Tresp, Feature Selection Methods for an
Improved SVM Classifier, Proceedings of the 14th International
Conference on Computational and Information Science, pp 83-89,
Prague, August 2006
[1] S. Chakrabarti, "Mining the Web- Discovering Knowledge from
hypertext data", Morgan Kaufmann Press, 2003.
[2] G. Forman, "A Pitfall and Solution in Multi-Class Feature Selection for
Text Classification", Proceedings of the 21st International Conference
on Machine Learning, Banff, Canada, 2004.
[3] T. Jebara, "Multi Task Feature and Kernel Selection for SVMs",
Proceedings of the 21st International Conference on Machine Learning,
Banff, Canada, 2004.
[4] T. Mitchell, "Machine Learning", McGraw Hill Publishers, 1997.
[5] D. Mladenic, J. Brank, M. Grobelnik and N. Milic-Frayling, "Feature
Selection Using Support Vector Machines", The 27th Annual
International ACM SIGIR Conference (SIGIR2004), pp 234-241, 2004.
[6] D. Morariu, "Classification and Clustering using Support Vector
Machine", 2nd PhD Report, University ÔÇ×Lucian Blaga" of Sibiu,
September, 2005, http://webspace.ulbsibiu.ro/ daniel.morariu/html/Docs
/Report2.pdf.
[7] D. Morariu, L. Vintan, "A Better Correlation of the SVM kernel-s
Parameters", Proceeding of The 5th RoEduNet International Conference,
Sibiu, June 2006.
[8] C. Nello, J. Swawe-Taylor, "An introduction to Support Vector
Machines", Cambridge University Press, 2000.
[9] J. Platt, "Fast training of support vector machines using sequential
minimal optimization". In B. Scholkopf, C. J. C. Burges, and A. J.
Smola, editors, Advances in Kernel Methods - Support Vector Learning,
pages 185-208, Cambridge, MA, 1999, MIT Press.
[10] Reuters Corpus: http://about.reuters.com/researchandstandards/corpus/.
Released in November 2000.
[11] B. Schoelkopf, A. Smola, "Learning with Kernels, Support Vector
Machines", MIT Press, London, 2002.
[12] Whitely, D., A genetic Algorithm Tutorial, Foundations of Genetic
Algorithms, ed. Morgan Kaufmann
[13] G, F. Luger, W. A. Stubblefield, Artificial Intelligence, Addison Wesley
Longman, Third Edition, 1998
[14] G. Kim, S. Kim, Feature Selection Using Genetic Algorithms for
Handwritten Character Recognition, Proceedings of the Seventh
International Workshop on Frontiers in Handwriting Recognition,
Amsterdam, 2000
[15] A. E. Eiben, J. E. Smith, Introduction to evolutionary computing,
Springer-Verlag, 2003
[16] D. Morariu, L. Vintan, V. Tresp, Feature Selection Methods for an
Improved SVM Classifier, Proceedings of the 14th International
Conference on Computational and Information Science, pp 83-89,
Prague, August 2006
@article{"International Journal of Information, Control and Computer Sciences:64869", author = "Daniel I. Morariu and Lucian N. Vintan and Volker Tresp", title = "Evolutionary Feature Selection for Text Documents using the SVM", abstract = "Text categorization is the problem of classifying text
documents into a set of predefined classes. After a preprocessing
step, the documents are typically represented as large sparse vectors.
When training classifiers on large collections of documents, both the
time and memory restrictions can be quite prohibitive. This justifies
the application of feature selection methods to reduce the
dimensionality of the document-representation vector. In this paper,
we present three feature selection methods: Information Gain,
Support Vector Machine feature selection called (SVM_FS) and
Genetic Algorithm with SVM (called GA_SVM). We show that the
best results were obtained with GA_SVM method for a relatively
small dimension of the feature vector.", keywords = "Feature Selection, Learning with Kernels, Support
Vector Machine, Genetic Algorithm, and Classification.", volume = "2", number = "9", pages = "3279-7", }