Abstract: Text categorization is the problem of classifying text
documents into a set of predefined classes. After a preprocessing
step, the documents are typically represented as large sparse vectors.
When training classifiers on large collections of documents, both the
time and memory restrictions can be quite prohibitive. This justifies
the application of feature selection methods to reduce the
dimensionality of the document-representation vector. In this paper,
we present three feature selection methods: Information Gain,
Support Vector Machine feature selection called (SVM_FS) and
Genetic Algorithm with SVM (called GA_SVM). We show that the
best results were obtained with GA_SVM method for a relatively
small dimension of the feature vector.
Abstract: In text categorization problem the most used method
for documents representation is based on words frequency vectors
called VSM (Vector Space Model). This representation is based only
on words from documents and in this case loses any “word context"
information found in the document. In this article we make a
comparison between the classical method of document representation
and a method called Suffix Tree Document Model (STDM) that is
based on representing documents in the Suffix Tree format. For the
STDM model we proposed a new approach for documents
representation and a new formula for computing the similarity
between two documents. Thus we propose to build the suffix tree
only for any two documents at a time. This approach is faster, it has
lower memory consumption and use entire document representation
without using methods for disposing nodes. Also for this method is
proposed a formula for computing the similarity between documents,
which improves substantially the clustering quality. This
representation method was validated using HAC - Hierarchical
Agglomerative Clustering. In this context we experiment also the
stemming influence in the document preprocessing step and highlight
the difference between similarity or dissimilarity measures to find
“closer" documents.
Abstract: Text categorization is the problem of classifying text
documents into a set of predefined classes. In this paper, we
investigated three approaches to build a meta-classifier in order to
increase the classification accuracy. The basic idea is to learn a metaclassifier
to optimally select the best component classifier for each
data point. The experimental results show that combining classifiers
can significantly improve the accuracy of classification and that our
meta-classification strategy gives better results than each individual
classifier. For 7083 Reuters text documents we obtained a
classification accuracies up to 92.04%.