Bottom Up Text Mining through Hierarchical Document Representation

Most of the existing text mining approaches are proposed, keeping in mind, transaction databases model. Thus, the mined dataset is structured using just one concept: the “transaction", whereas the whole dataset is modeled using the “set" abstract type. In such cases, the structure of the whole dataset and the relationships among the transactions themselves are not modeled and consequently, not considered in the mining process. We believe that taking into account structure properties of hierarchically structured information (e.g. textual document, etc ...) in the mining process, can leads to best results. For this purpose, an hierarchical associations rule mining approach for textual documents is proposed in this paper and the classical set-oriented mining approach is reconsidered profits to a Direct Acyclic Graph (DAG) oriented approach. Natural languages processing techniques are used in order to obtain the DAG structure. Based on this graph model, an hierarchical bottom up algorithm is proposed. The main idea is that each node is mined with its parent node.




References:
[1] R. Agrawal, S. Rajagopalan, R. Srikant, Y. Xu, "Mining Newsgroups
Using Networks Arising From Social Behavior", Proceedings of the
Twelfth Int-l World Wide Web Conference, Budapest, Hungary, May
2003.
[2] R. Agrawal, T. Imielinski, A. Swami, "Mining associations rules
between sets of items in large databases", In Proc of the ACM SIGMOD
Conference on Management of Data, Washington, D.C., 1993, pp. 207-
216.
[3] F. Berzal, J.C. Cubero, N. Marin, J.M. Serrano, "TBAR: An efficient
Method for Association Rule Mining in Relational Databases", Data
Mining and Knowledge Discovery Journal, Kluwer Academic
Publishers, vol 6, 2002.
[4] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K.
Nigam, S. Slattery, "Learning to Construct Knowledge Bases from the
World Wide Web", Artificial Intelligence Review, vol 118, 2000, pp 69-
114.
[5] R. Dale, "Exploring the Role of Punctuation in the Signalling of
Discourse Structure", Workshop on Text Representation and Domain
Modelling, Berlin, 1991 pp. 110-120.
[6] R. Feldman "Mining Text Data", Chapter 21 in Handbook of Data
Mining, Lawrence Erlbaum Associates, 2003, 48 pages.
[7] R. Feldman, H. Hirsh, "Finding Associations in Collections of Text",
Machine Learning, Data mining and Knowledge Discovery: Methods
and Application, In R.S. Michalski, I. Bratko, and M. Kubat editor, John
Wiley and Sons Ltd, 1997.
[8] M.H. Haddad, J.P. Chevallet, M.F. Bruandet, "Relations between Terms
Discovered by Association Rules", European Conference on principles
and Practices of Knowledge Discovery in Databases, PKDD-2000,
Lyon, France, September 2000.
[9] Hearst. "Untangling Text Data Mining". Proceedings of ACL-99, 37th
Annual Meeting of the Association for Computational Linguistics, 1999.
[10] M. Montes-Y-Gomez, A. Gelbukh, A. Lopez-Lopez, R. Baeza-Yates,
"Text Mining with Conceptual Graphs", Symposium of Natural
Languages Processing and Knowledge Engineering, NLPKE-2001,
IEEE, Tucson, USA, October 2001.
[11] M. Rajman, R. Besançon, "Text Mining - Knowledge extraction from
unstructured textual data", Proc. of 6th Conference of International
Federation of Classification Societies (IFCS-98), Roma (Italy), July 98,
pp 473-480.
[12] S. Ray, M. Craven, "Representing Sentence Structure in Hidden Markov
Models for Information Extraction", Proceedings of the 17th
International Joint Conference on Artificial Intelligence. IJCAI 2001.
[13] F. Souam, "Transactions Expansion for Mining Hierarchical textual
Documents". Master Thesis, University of Tizi-Ouzou, Algeria, to
appear in 2006.
[14] E. Pascual, J. Virbel, "Semantic and Layout Properties of Text
Punctuation", Workshop on Punctuation in Computational Linguistics,
ACL-96, USA, June 1996.
[15] R. Srikant, R. Agrawal, "Mining Generalized Associations Rules",
Proceedings of the 21st VLDB Conference, Zurich, Switzerland, 1995.