The Influence of Preprocessing Parameters on Text Categorization

Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although partial studies are available, many authors fail to report values of the parameters they use in their experiments, or reasons why these values were used instead of others. The goal of this work then is to create a more thorough comparison of preprocessing parameters and their mutual influence, and report interesting observations and results.





References:
[1] Y. Yang and J. O. Pedersen, "A comparative study on feature selection
in text categorization," in Proceedings of ICML-97, 14th International
Conference on Machine Learning, D. H. Fisher, Ed. Nashville, US:
Morgan Kaufmann Publishers, San Francisco, US, 1997, pp. 412-420.
[2] E. Gabrilovich and S. Markovitch, "Text categorization with many
redundant features: Using aggressive feature selection to make SVMs
competitive with C4.5," in Proc. 21st Int. Conf. on Machine Learning,
2004.
[3] J. H. Lee, "Analyses of multiple evidence combination," in Proceedings
of the 20th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, ser. Combination Techniques,
1997, pp. 267-276.
[4] M. F. Porter, "An algorithm for suffix stripping," Program, vol. 14, no. 3,
pp. 130-137, 1980.
[5] R. Krovetz, "Viewing morphology as an inference process," in Proceedings
of the Sixteenth Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, ser. Linguistic
Analysis, 1993, pp. 191-202.
[6] C. D. Paice, "Another stemmer," SIGIR Forum, vol. 24, no. 3, pp. 56-61,
1990.
[7] J. B. Lovins, "Development of a stemming algorithm," Mechanical
Translation, vol. 11, pp. 22-31, 1968.
[8] G. Forman, "An extensive empirical study of feature selection metrics
for text classification," Journal of Machine Learning Research, vol. 3,
pp. 1289-1305, 2003.
[9] C. J. V. Rijsbergen, Information Retrieval. Butterworths, 1979.
[10] J. W. Wilbur and K. Sirotkin, "The automatic identification of stop
words," Journal of the American Society for Information Science,
vol. 18, pp. 45-55, 1992.
[11] L. Galavotti, F. Sebastiani, and M. Simi, "Experiments on the use of
feature selection and negative evidence in automated text categorization,"
in ECDL, ser. Lecture Notes in Computer Science, J. L. Borbinha and
T. Baker, Eds., vol. 1923. Springer, 2000, pp. 59-68.
[12] T. Joachims, "Making large-scale SVM learning practical," in Advances
in Kernel Methods ÔÇö Support Vector Learning, B. Sch┬¿olkopf, C. J. C.
Burges, and A. J. Smola, Eds. Cambridge, MA: MIT Press, 1999, pp.169-184.
[13] A. McCallum and K. Nigam, "A comparison of event models for naive
bayes text classification," in Proceedings of AAAI-98, Workshop on
Learning for Text Categorization, 1998.
[14] J. L. Wiener, Pedersen, and Weigend., "A neural network approach to
topic spotting," Proc of the Fourth Annual Symp on Document Analysis
and Info, pp. 317-332, 1995.
[15] J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan
Kaufmann, 1992.
[16] Y. Freund and R. E. Schapire, "Large margin classification using the
perceptron algorithm," MACHLEARN: Machine Learning, vol. 37, 1999.
[17] E. F. Ian H. Witten, Data Mining: Practical Machine Learning Tools
and Techniques. Morgan Kaufmann, 2005.
[18] Y. Yang, "A study on thresholding strategies for text categorization," in
Proceedings of the 24th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR-01), W. B.
Croft, D. J. Harper, D. H. Kraft, and J. Zobel, Eds. New York: ACM
Press, Sept. 9-13 2001, pp. 137-145.
[19] G. Salton, The SMART Retrieval System - Experiments in Automatic
Document Processing. Prentice Hall, 1971.
[20] K. Lang, "Newsweeder: Learning to filter netnews," in ICML, 1995, pp.
331-339.
[21] J. ˇ Ziˇzka and T. Hud'ık, "Effects of selected basic algorithm parameters
and data features on text categorization by support vector machines,"
in Proceedings of Znalosti 2005. VˇSB-Technick'a univerzita Ostrava,
2005, pp. 210-217.