A Content Vector Model for Text Classification

As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications. In this paper, an LSI-based content vector model for text classification is presented, which constructs multiple augmented category LSI spaces and classifies text by their content. The model integrates the class discriminative information from the training data and is equipped with several pertinent feature selection and text classification algorithms. The proposed classifier has been applied to email classification and its experiments on a benchmark spam testing corpus (PU1) have shown that the approach represents a competitive alternative to other email classifiers based on the well-known SVM and naïve Bayes algorithms.

Authors:



References:
[1] Androutsopoulos, G. Paliouras, and E. Michelakis (2004). "Learning to
filter unsolicited commercial e-mail".Technical Report 2004/2, NCSR
Demokritos.
[2] N. Christianini and J. Shawe-Taylor (2000). An introduction to Support
Vector Machines and other kernel-based learning methods. Cambridge
University Press.
[3] S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman
(1990) "Indexing by Latent Semantic Analysis". Journal of the
American Society for Information Science. 41, 391-409.
[4] K. Gee (2003). "Using Latent Semantic Indexing to Filter Spam".
Proceedings of the 2003 ACM Symposium on Applied Computing, 460-
464.
[5] G. Golub and C. Van Loan (1996). Matrix Computations. John-Hopkins,
Baltimore, 3rd edition.
[6] E. Jiang and M. Berry (2000). "Solving Total Least-Squares Problems in
Information Retrieval. Linear Algebra and its Applications, 316, 137-
156.
[7] T. Mitchell (1997). Machine Learning. McGraw-Hill.
[8] J. Quinlan (1993). C 4.5: Programs for Machine Learning. Morgan
Kaufmann.
[9] J, Rocchio (1971). "Relevance feedback information retrieval". The
Smart retrieval system-Experiments in automatic document processing,
(G. Salton ed.). Prentice-hall, 313-323.
[10] R. Schapier and Y. Singer (2000). "BoosTexter: a boosting-based system
for text categorization". Machine Learning, 39, 2/3, 135-168.
[11] F. Sebastiani (2002). "Machine learning in automated text
categorization". ACM Computing Surveys 334, 1, 1-47.
[12] H. Schutze, D.A. Hall and J.O. Pedersen (1995). "A Comparison of
Classifiers and Document Representations for the Routing Problem".
Proceedings of SIGIR, 1995, 229-237.
[13] Y. Yang and J. Pedersen (1997). "A comparative study on feature
selection in text categorization". Proceedings of the 14th International
conference on Machine Learning, 412-420.