Mining Association Rules from Unstructured Documents

This paper presents a system for discovering association rules from collections of unstructured documents called EART (Extract Association Rules from Text). The EART system treats texts only not images or figures. EART discovers association rules amongst keywords labeling the collection of textual documents. The main characteristic of EART is that the system integrates XML technology (to transform unstructured documents into structured documents) with Information Retrieval scheme (TF-IDF) and Data Mining technique for association rules extraction. EART depends on word feature to extract association rules. It consists of four phases: structure phase, index phase, text mining phase and visualization phase. Our work depends on the analysis of the keywords in the extracted association rules through the co-occurrence of the keywords in one sentence in the original text and the existing of the keywords in one sentence without co-occurrence. Experiments applied on a collection of scientific documents selected from MEDLINE that are related to the outbreak of H5N1 avian influenza virus.

Authors:



References:
[1] B. Lent, R. Agrawal, and R. Srikant, "Discovering trends in text
Databases," KDD-97, 1997, pp.227-230.
[2] C. Manning and H Sch├╝tze, Foundations of statistical natural language
processing (MIT Press, Cambridge, MA, 1999).
[3] D. Rösner and M. Kunze, "The XDOC Document Suite -- A Workbench
for Document Mining," In Text Mining - Theoretical Aspects and
Applications, Advances in Soft Computing, Physica - Verlag, 2003, 113-
130.
[4] G. W. Paynter, I. H. Witten, S. J. Cunningham, and G. Buchanan,
"Scalable browsing for large collections: a case study," 5th Conf. digital
Libraries, Texas, 2000, 215-218.
[5] H. Ahonen, O. Heinonen, M. klemettinen, and A. Inkeri Verkamo,
"Mining in the phrasal frontier," Proc. PKDD-97.1st European
Symposium on Principle of data Mining and Knowledge Discovery,
Norway, June, Trondheim, 1997.
[6] H. Ahonen, O. Heinonen, M. Klemettinen, and A. Inkeri Verkamo,
"Applying data mining technique for descriptive phrase extraction in
digital document collections,"Proc. of IEEE Forum on Research and
technology Advances in Digital Libraries, Santa Barbra CA, 1998, 2-11.
[7] H. Karanikas and B. Theodoulidis, "Knowledge discovery in text and
text mining software," Technical Report, UMIST Departement of
Computation, January 2002.
[8] H. Mannila, H. Toivonen and A. I. Verkamo, "Discovery of frequent
episodes in event sequences," Data Mining and Knowledge Discovery,
1(3), 1997b, pp. 259-289.
[9] J. Paralic and P. Bednar, "Text mining for documents annotation and
ontology support (A book chapter in: "intelligent systems at service of
Mankind," ISBN 3-935798-25-3, Ubooks, Germany, 2003).
[10] M. Rajman and R. Besancon, Text mining: natural language techniques
and text mining applications. Proc. 7th working conf. on database
semantics (DS-7), Chapan &Hall IFIP Proc. Series. Leysin, Switzerland
Oct. 1997, 7-10.
[11] R. Agrawal and R. Srikant, "Fast algorithms for mining association
rules," In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors,
Proc. 20th Int. conf. of very Large Data Bases, VLDB, Santigo, Chile,
1994, 487-499.
[12] R. Baeza-Yates and B. Ribeiro-Neto, Modern information retrieval
(Addison-Wesley, Longman publishing company, 1999).
[13] R. Feldman and I. Dagan, Knowledge discovery in textual databases
(KDT), Proc. 1st nt. Conf. on Knowledge Discovery and Data Mining,
1995.
[14] R. Feldman and H. Hirsh, "Mining associations in text in the presence of
background knowledge," Proc. 2nd Int. Conf. on Knowledge Discovery
and Data Mining, Portland, USA, 1996.
[15] S. Brin, R. Motwani, and C. Silverstein, "Beyond market baskets:
generalizing association rules to dependence rules," KDD-98, 1998, 39-
68.