The Usefulness of Logical Structure in Flexible Document Categorization

This paper presents a new approach for automatic document categorization. Exploiting the logical structure of the document, our approach assigns a HTML document to one or more categories (thesis, paper, call for papers, email, ...). Using a set of training documents, our approach generates a set of rules used to categorize new documents. The approach flexibility is carried out with rule weight association representing your importance in the discrimination between possible categories. This weight is dynamically modified at each new document categorization. The experimentation of the proposed approach provides satisfactory results.




References:
[1] C. Jebari & al., Catégorisation d-un document électronique en vue
d-une meilleure classification thématique, GEI-2002, Hammamet,
Tunisie, 2002.
[2] V. Chanana & al., A new context-based information retrieval system,
Accepted in 3rd WSEAS Int. Conf. On Artificial Intelligence,
Knowledge Engineering, Data Bases (AIKED 2004), Salzburg,
Austria, February 13-15, 2004.
[3] M. Maron, Automatic Indexing: An Experimental Inquiry, Journal of
the Association for Computing Machinery, 1961, 8(3): pp. 404 -
417.
[4] F. Sebastiani, Machine Learning in Automated Text Categorization,
ACM Computing Surveys, Pisa, Italy, 2002.
[5] J. Karlgren and D. Cutting, Recognizing Text Genres with Simple
Metrics Using Discriminant Analysis, Proc. Of COLING1994,
Kyoto, 1994.
[6] L. Yong-Bae and Sung Hyon, Automatic Identification of Text Genres
and Their Roles in Subject-Based Categorization, In Proceedings of
the 37th Hawaii International Conference on System Sciences, 2004.
[7] B. Kessler & al., Automatic Detection of Text Genre, ACL-97, pages
32 - 38, July 1997.
[8] E. Stamatatos, Text Genre Detection Using Common Word
Frequencies, Proc. Of the 18th International Conference on
COLING2000, 2000.
[9] C. Kevin and W. Marie, Reproduced and emergent genres of
communication on the world-wide web, In Proceedings of the 30th
Hawaii International Conference on System Sciences (HICSS-30),
Institute of Electrical and Electronics Engineers, 1997.
[10] A. Marzin & al., Classification de pages web en genre, Journée
d-études ATALA-2004, Grenoble, France, janvier 2004.
[11] C. Apte & al., Automated learning of decision rules for text
categorization, ACM Transactions on Information Systems, 1994,
12(3): pp. 233 - 251.
[12] P.J. Hayes, CONSTRUE/TIS: a system for content-based indexing of
a database of news stories, In Proceedings of IAAI-90, 2nd
Conference on Innovative Applications of Artificial Intelligence,
1990, pp. 1 - 5.
[13] T. Mitchell, Machine Learning, McGraw Hill International editions,
Computer Science series, ISBN 0-07-042807-7, 1997.
[14] J. J. Rocchio, Relevance Feedback in Information Retrieval, In the
SMART retrieval system, G. Salton, pp. 313 - 323, Prentice Hall,
Inc., 1971.
[15] R.O. Duda & al., Pattern Classification and Scene Analysis, John
Wiley & Sons, 1973.
[16] L. Breiman and al., Classification and Regression Trees, Belmont,
CA: Wadsworth, 1984.
[17] V. Vapnik, The Nature of Statistical Learning Theory, Springer -
Verlag, 1995.
[18] L. Breiman, Bagging predictors, Machine Learning. Vol. 24, 1996,
pp. 123 - 140.
[19] Y. Freund and Shapire, Experiments with a new boosting algorithm,
In Proceeding of 13th international conference on Machine
Learning, 1996, pp. 148 - 156.
[20] J.R. Quinlan, C4.5: Programming for machine Learning, Morgan
Kaufman, 1993.
[21] J.R. Quinlan, Learning efficient classification procedures and their
application to chess and games, In R. S. Michalski, J. G. Carbonell
and T. M. Mitchell editors, Machine Learning: An Artificial
Intelligence Approach. Vol. 1, pp. 463 - 482, 1983.
[22] E. Mephu Nguifo, Treillis de Galois et Classification Supervisée,
Séminaire LIMOS, Clermont - Ferrand, 7 mars 2002.
[23] R. Rakotomalala, Graphes d-Induction, Thèse de doctorat de
l-université Claude Bernard - Lyon I, décembre 1997.
[24] D.A. Zighed et al., SIPINA : Méthode et logiciel, Editions Alexandre
Lacassagne, Mathématiques appliquées n┬░2, 1992.