The Usefulness of Logical Structure in Flexible Document Categorization
This paper presents a new approach for automatic
document categorization. Exploiting the logical structure of the
document, our approach assigns a HTML document to one or more
categories (thesis, paper, call for papers, email, ...). Using a set of
training documents, our approach generates a set of rules used to
categorize new documents. The approach flexibility is carried out
with rule weight association representing your importance in the
discrimination between possible categories. This weight is
dynamically modified at each new document categorization. The
experimentation of the proposed approach provides satisfactory
results.
[1] C. Jebari & al., Catégorisation d-un document électronique en vue
d-une meilleure classification thématique, GEI-2002, Hammamet,
Tunisie, 2002.
[2] V. Chanana & al., A new context-based information retrieval system,
Accepted in 3rd WSEAS Int. Conf. On Artificial Intelligence,
Knowledge Engineering, Data Bases (AIKED 2004), Salzburg,
Austria, February 13-15, 2004.
[3] M. Maron, Automatic Indexing: An Experimental Inquiry, Journal of
the Association for Computing Machinery, 1961, 8(3): pp. 404 -
417.
[4] F. Sebastiani, Machine Learning in Automated Text Categorization,
ACM Computing Surveys, Pisa, Italy, 2002.
[5] J. Karlgren and D. Cutting, Recognizing Text Genres with Simple
Metrics Using Discriminant Analysis, Proc. Of COLING1994,
Kyoto, 1994.
[6] L. Yong-Bae and Sung Hyon, Automatic Identification of Text Genres
and Their Roles in Subject-Based Categorization, In Proceedings of
the 37th Hawaii International Conference on System Sciences, 2004.
[7] B. Kessler & al., Automatic Detection of Text Genre, ACL-97, pages
32 - 38, July 1997.
[8] E. Stamatatos, Text Genre Detection Using Common Word
Frequencies, Proc. Of the 18th International Conference on
COLING2000, 2000.
[9] C. Kevin and W. Marie, Reproduced and emergent genres of
communication on the world-wide web, In Proceedings of the 30th
Hawaii International Conference on System Sciences (HICSS-30),
Institute of Electrical and Electronics Engineers, 1997.
[10] A. Marzin & al., Classification de pages web en genre, Journée
d-études ATALA-2004, Grenoble, France, janvier 2004.
[11] C. Apte & al., Automated learning of decision rules for text
categorization, ACM Transactions on Information Systems, 1994,
12(3): pp. 233 - 251.
[12] P.J. Hayes, CONSTRUE/TIS: a system for content-based indexing of
a database of news stories, In Proceedings of IAAI-90, 2nd
Conference on Innovative Applications of Artificial Intelligence,
1990, pp. 1 - 5.
[13] T. Mitchell, Machine Learning, McGraw Hill International editions,
Computer Science series, ISBN 0-07-042807-7, 1997.
[14] J. J. Rocchio, Relevance Feedback in Information Retrieval, In the
SMART retrieval system, G. Salton, pp. 313 - 323, Prentice Hall,
Inc., 1971.
[15] R.O. Duda & al., Pattern Classification and Scene Analysis, John
Wiley & Sons, 1973.
[16] L. Breiman and al., Classification and Regression Trees, Belmont,
CA: Wadsworth, 1984.
[17] V. Vapnik, The Nature of Statistical Learning Theory, Springer -
Verlag, 1995.
[18] L. Breiman, Bagging predictors, Machine Learning. Vol. 24, 1996,
pp. 123 - 140.
[19] Y. Freund and Shapire, Experiments with a new boosting algorithm,
In Proceeding of 13th international conference on Machine
Learning, 1996, pp. 148 - 156.
[20] J.R. Quinlan, C4.5: Programming for machine Learning, Morgan
Kaufman, 1993.
[21] J.R. Quinlan, Learning efficient classification procedures and their
application to chess and games, In R. S. Michalski, J. G. Carbonell
and T. M. Mitchell editors, Machine Learning: An Artificial
Intelligence Approach. Vol. 1, pp. 463 - 482, 1983.
[22] E. Mephu Nguifo, Treillis de Galois et Classification Supervisée,
Séminaire LIMOS, Clermont - Ferrand, 7 mars 2002.
[23] R. Rakotomalala, Graphes d-Induction, Thèse de doctorat de
l-université Claude Bernard - Lyon I, décembre 1997.
[24] D.A. Zighed et al., SIPINA : Méthode et logiciel, Editions Alexandre
Lacassagne, Mathématiques appliquées n┬░2, 1992.
[1] C. Jebari & al., Catégorisation d-un document électronique en vue
d-une meilleure classification thématique, GEI-2002, Hammamet,
Tunisie, 2002.
[2] V. Chanana & al., A new context-based information retrieval system,
Accepted in 3rd WSEAS Int. Conf. On Artificial Intelligence,
Knowledge Engineering, Data Bases (AIKED 2004), Salzburg,
Austria, February 13-15, 2004.
[3] M. Maron, Automatic Indexing: An Experimental Inquiry, Journal of
the Association for Computing Machinery, 1961, 8(3): pp. 404 -
417.
[4] F. Sebastiani, Machine Learning in Automated Text Categorization,
ACM Computing Surveys, Pisa, Italy, 2002.
[5] J. Karlgren and D. Cutting, Recognizing Text Genres with Simple
Metrics Using Discriminant Analysis, Proc. Of COLING1994,
Kyoto, 1994.
[6] L. Yong-Bae and Sung Hyon, Automatic Identification of Text Genres
and Their Roles in Subject-Based Categorization, In Proceedings of
the 37th Hawaii International Conference on System Sciences, 2004.
[7] B. Kessler & al., Automatic Detection of Text Genre, ACL-97, pages
32 - 38, July 1997.
[8] E. Stamatatos, Text Genre Detection Using Common Word
Frequencies, Proc. Of the 18th International Conference on
COLING2000, 2000.
[9] C. Kevin and W. Marie, Reproduced and emergent genres of
communication on the world-wide web, In Proceedings of the 30th
Hawaii International Conference on System Sciences (HICSS-30),
Institute of Electrical and Electronics Engineers, 1997.
[10] A. Marzin & al., Classification de pages web en genre, Journée
d-études ATALA-2004, Grenoble, France, janvier 2004.
[11] C. Apte & al., Automated learning of decision rules for text
categorization, ACM Transactions on Information Systems, 1994,
12(3): pp. 233 - 251.
[12] P.J. Hayes, CONSTRUE/TIS: a system for content-based indexing of
a database of news stories, In Proceedings of IAAI-90, 2nd
Conference on Innovative Applications of Artificial Intelligence,
1990, pp. 1 - 5.
[13] T. Mitchell, Machine Learning, McGraw Hill International editions,
Computer Science series, ISBN 0-07-042807-7, 1997.
[14] J. J. Rocchio, Relevance Feedback in Information Retrieval, In the
SMART retrieval system, G. Salton, pp. 313 - 323, Prentice Hall,
Inc., 1971.
[15] R.O. Duda & al., Pattern Classification and Scene Analysis, John
Wiley & Sons, 1973.
[16] L. Breiman and al., Classification and Regression Trees, Belmont,
CA: Wadsworth, 1984.
[17] V. Vapnik, The Nature of Statistical Learning Theory, Springer -
Verlag, 1995.
[18] L. Breiman, Bagging predictors, Machine Learning. Vol. 24, 1996,
pp. 123 - 140.
[19] Y. Freund and Shapire, Experiments with a new boosting algorithm,
In Proceeding of 13th international conference on Machine
Learning, 1996, pp. 148 - 156.
[20] J.R. Quinlan, C4.5: Programming for machine Learning, Morgan
Kaufman, 1993.
[21] J.R. Quinlan, Learning efficient classification procedures and their
application to chess and games, In R. S. Michalski, J. G. Carbonell
and T. M. Mitchell editors, Machine Learning: An Artificial
Intelligence Approach. Vol. 1, pp. 463 - 482, 1983.
[22] E. Mephu Nguifo, Treillis de Galois et Classification Supervisée,
Séminaire LIMOS, Clermont - Ferrand, 7 mars 2002.
[23] R. Rakotomalala, Graphes d-Induction, Thèse de doctorat de
l-université Claude Bernard - Lyon I, décembre 1997.
[24] D.A. Zighed et al., SIPINA : Méthode et logiciel, Editions Alexandre
Lacassagne, Mathématiques appliquées n┬░2, 1992.
@article{"International Journal of Information, Control and Computer Sciences:53341", author = "Jebari Chaker and Ounalli Habib", title = "The Usefulness of Logical Structure in Flexible Document Categorization", abstract = "This paper presents a new approach for automatic
document categorization. Exploiting the logical structure of the
document, our approach assigns a HTML document to one or more
categories (thesis, paper, call for papers, email, ...). Using a set of
training documents, our approach generates a set of rules used to
categorize new documents. The approach flexibility is carried out
with rule weight association representing your importance in the
discrimination between possible categories. This weight is
dynamically modified at each new document categorization. The
experimentation of the proposed approach provides satisfactory
results.", keywords = "categorization rule, document categorization,flexible categorization, logical structure.", volume = "1", number = "12", pages = "3830-4", }