As a popular rank-reduced vector space approach,
Latent Semantic Indexing (LSI) has been used in information
retrieval and other applications. In this paper, an LSI-based content
vector model for text classification is presented, which constructs
multiple augmented category LSI spaces and classifies text by their
content. The model integrates the class discriminative information
from the training data and is equipped with several pertinent feature
selection and text classification algorithms. The proposed classifier
has been applied to email classification and its experiments on a
benchmark spam testing corpus (PU1) have shown that the approach
represents a competitive alternative to other email classifiers based
on the well-known SVM and naïve Bayes algorithms.
[1] Androutsopoulos, G. Paliouras, and E. Michelakis (2004). "Learning to
filter unsolicited commercial e-mail".Technical Report 2004/2, NCSR
Demokritos.
[2] N. Christianini and J. Shawe-Taylor (2000). An introduction to Support
Vector Machines and other kernel-based learning methods. Cambridge
University Press.
[3] S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman
(1990) "Indexing by Latent Semantic Analysis". Journal of the
American Society for Information Science. 41, 391-409.
[4] K. Gee (2003). "Using Latent Semantic Indexing to Filter Spam".
Proceedings of the 2003 ACM Symposium on Applied Computing, 460-
464.
[5] G. Golub and C. Van Loan (1996). Matrix Computations. John-Hopkins,
Baltimore, 3rd edition.
[6] E. Jiang and M. Berry (2000). "Solving Total Least-Squares Problems in
Information Retrieval. Linear Algebra and its Applications, 316, 137-
156.
[7] T. Mitchell (1997). Machine Learning. McGraw-Hill.
[8] J. Quinlan (1993). C 4.5: Programs for Machine Learning. Morgan
Kaufmann.
[9] J, Rocchio (1971). "Relevance feedback information retrieval". The
Smart retrieval system-Experiments in automatic document processing,
(G. Salton ed.). Prentice-hall, 313-323.
[10] R. Schapier and Y. Singer (2000). "BoosTexter: a boosting-based system
for text categorization". Machine Learning, 39, 2/3, 135-168.
[11] F. Sebastiani (2002). "Machine learning in automated text
categorization". ACM Computing Surveys 334, 1, 1-47.
[12] H. Schutze, D.A. Hall and J.O. Pedersen (1995). "A Comparison of
Classifiers and Document Representations for the Routing Problem".
Proceedings of SIGIR, 1995, 229-237.
[13] Y. Yang and J. Pedersen (1997). "A comparative study on feature
selection in text categorization". Proceedings of the 14th International
conference on Machine Learning, 412-420.
[1] Androutsopoulos, G. Paliouras, and E. Michelakis (2004). "Learning to
filter unsolicited commercial e-mail".Technical Report 2004/2, NCSR
Demokritos.
[2] N. Christianini and J. Shawe-Taylor (2000). An introduction to Support
Vector Machines and other kernel-based learning methods. Cambridge
University Press.
[3] S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman
(1990) "Indexing by Latent Semantic Analysis". Journal of the
American Society for Information Science. 41, 391-409.
[4] K. Gee (2003). "Using Latent Semantic Indexing to Filter Spam".
Proceedings of the 2003 ACM Symposium on Applied Computing, 460-
464.
[5] G. Golub and C. Van Loan (1996). Matrix Computations. John-Hopkins,
Baltimore, 3rd edition.
[6] E. Jiang and M. Berry (2000). "Solving Total Least-Squares Problems in
Information Retrieval. Linear Algebra and its Applications, 316, 137-
156.
[7] T. Mitchell (1997). Machine Learning. McGraw-Hill.
[8] J. Quinlan (1993). C 4.5: Programs for Machine Learning. Morgan
Kaufmann.
[9] J, Rocchio (1971). "Relevance feedback information retrieval". The
Smart retrieval system-Experiments in automatic document processing,
(G. Salton ed.). Prentice-hall, 313-323.
[10] R. Schapier and Y. Singer (2000). "BoosTexter: a boosting-based system
for text categorization". Machine Learning, 39, 2/3, 135-168.
[11] F. Sebastiani (2002). "Machine learning in automated text
categorization". ACM Computing Surveys 334, 1, 1-47.
[12] H. Schutze, D.A. Hall and J.O. Pedersen (1995). "A Comparison of
Classifiers and Document Representations for the Routing Problem".
Proceedings of SIGIR, 1995, 229-237.
[13] Y. Yang and J. Pedersen (1997). "A comparative study on feature
selection in text categorization". Proceedings of the 14th International
conference on Machine Learning, 412-420.
@article{"International Journal of Information, Control and Computer Sciences:61009", author = "Eric Jiang", title = "A Content Vector Model for Text Classification", abstract = "As a popular rank-reduced vector space approach,
Latent Semantic Indexing (LSI) has been used in information
retrieval and other applications. In this paper, an LSI-based content
vector model for text classification is presented, which constructs
multiple augmented category LSI spaces and classifies text by their
content. The model integrates the class discriminative information
from the training data and is equipped with several pertinent feature
selection and text classification algorithms. The proposed classifier
has been applied to email classification and its experiments on a
benchmark spam testing corpus (PU1) have shown that the approach
represents a competitive alternative to other email classifiers based
on the well-known SVM and naïve Bayes algorithms.", keywords = "Feature Selection, Latent Semantic Indexing,Text Classification, Vector Space Model.", volume = "2", number = "1", pages = "161-5", }