A Study on Finding Similar Document with Multiple Categories

Searching similar documents and document
management subjects have important place in text mining. One of the
most important parts of similar document research studies is the
process of classifying or clustering the documents. In this study, a
similar document search approach that includes discussion of out the
case of belonging to multiple categories (multiple categories
problem) has been carried. The proposed method that based on Fuzzy
Similarity Classification (FSC) has been compared with Rocchio
algorithm and naive Bayes method which are widely used in text
mining. Empirical results show that the proposed method is quite
successful and can be applied effectively. For the second stage,
multiple categories vector method based on information of categories
regarding to frequency of being seen together has been used.
Empirical results show that achievement is increased almost two
times, when proposed method is compared with classical approach.





References:
<p>[1] S.S. Weng and C.K. Liu, Using text classification and multiple concepts
to answer e-mails, Expert Systems with Applications 26(4) ,529-543,
2004.
[2] D. Elworthy, Question answering using a large NLP system, The Ninth
Text Retrieval Conference, Gaithersburg, 2000.
[3] C. Apte, P. Damerau and S. Weiss, Text Mining with Decision Rules
and Decision Trees, In Proceedings of the Conference Automated
Learning and Discovery, CMU, 1998.
[4] J.R. Quinlan, Induction of Decision Trees, Machine Learning Journal 1
81-108, 1986.
[5] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz, A Bayesian
Approach to Filtering Junk e-mail, AAAI 98, Workshops on Text
Categorization, 1998.
[6] K. Tzeras and S. Hartmann, Automatic Indexing Based on Bayesian
Inference Networks, In Proceedings of the 16th Annual ACM/SIGIR
Conference on Research and Development in Information Retrieval, 22-
34, 1993.
[7] E. Wiener, J. Pederson and A. Weigend, A Neural Network Approach to
Topic Spotting, Fourth Annual Symposium on Document Analysis and
Information Retrieval, 1995.
[8] G. Guo, H. Wang, D. Bell, Y. Bi and K. Greer, Using kNN model for
automatic text categorization, Soft Computing 10,423-430, 2006.
[9] S.S. Weng and Y.J. Lin, A Study On Searching For Similar Documents
Based On Multiple Concepts And Distribution Of Concepts, Expert
Systems with Applications 25(3) 355-368, 2003.
[10] B. Masand, G. Linoff, and D. Waltz, Classifying News Stories Using
Memory Based Reasoning, In Proceedings of the 15th Annual, 1992.
[11] S. Tan, Neighbor-weighted K-nearest neighbor for unbalanced text
corpus, Expert Systems with Applications, 28, 667-671, 2005.
[12] I.S. Dhillon, J. Fan and Y. Guan, Efficient Clustering of Very Large
Document Collections, In Data Mining for Scientific and Engineering
Applications, Kluwer Academic Publishers 357-381, 2001.
[13] S. Dumais, J. Platt, D. Heckerman and M. Sahami, Inductive Learning
Algorithm and Representations for Text Categorization, In Proceedings
of the 1998 ACM 7th International Conference on Information and
Knowledge Management 148-155, 1998.
[14] T. Joachims, Text Categorization with Support Vector Machines:
Learning with Many Relevant Features, In Proceedings of the 10th
European Conference on Machine Learning 1, 137-142, 1998.
[15] A. Klose, A. N&uuml;rnberger, R. Kruse, G. Hartmann, and M. Richards,
Interactive Text Retrieval Based on Document Similarities, Phys. Chem.
Earth (A), 25(8), 649-654, 2000.
[16] [.C. Yang and C.H. Lee, A text mining approach on automatic
generation of web directories and hierarchies, Expert Systems with
Applications, 27, 645-663, 2004.
[17] H.C. Yang and C.H. Lee, A text mining approach on automatic
construction of hypertexts, Expert Systems with Applications 29(4), 723-
734, 2005.
[18] D.H. Widyantoro, and J. Yen, A Fuzzy Similarity Approach in Text
Classification Task, IEEE, 2000.
[19] S. Miyamoto, Fuzzy Multisets and Fuzzy Clustering of Documents, In
Proc. of the IEEE International Conference on Fuzzy Systems, FUZZIEEE,
2001.
[20] G. Salton, and C. Buckley, Term Weighting Approaches in Automatic
Text Retrieval, Information Processing and Management, 24(5), 513-
523, 1998.
[21] R. Sara&ccedil;oğlu, K. T&uuml;t&uuml;nc&uuml; and N. Allahverdi, A Fuzzy Clustering
Approach for Finding Similar Documents Using a Novel Similarity
Measure, Expert Systems with Applications, 33(3), 600-605, 2007.
[22] X Wan, A novel document similarity measure based on earth mover&rsquo;s
distance, Information Sciences, 177, 3718-3730, 2007.
[23] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri and K. Shim,
XTRACT: Learning Document Type Descriptors from XML Document
Collections, Data Mining and Knowledge Discovery, 7, 23&ndash;56, 2003.
[24] Y. Zhao and G. Karypis, Hierarchical Clustering Algorithms for
Document Datasets, Data Mining and Knowledge Discovery, 10, 141-
168, 2005.
[25] C.L.A. Clarke, G.V. Cormack, D.I.E. Kisman and T.R. Lynam, Question
answering by passage selection, The Ninth Text Retrieval Conference,
Gaithersburg, 2000.
[26] R. Sara&ccedil;oğlu, Searching for Similar Documents Using Fuzzy Clustering,
PhD Thesis, Institute of the Natural and Applied Sciences, Sel&ccedil;uk
University, 2007.
[27] S. Kim, D. Baek, S. Kim, H. Rim, Question Answering Considering
Semantic Categories and Co-occurrence Density, The Ninth Text
Retrieval Conference, 2000.
[28] T.S. Morton, Using Coreference in Question Answering, The Eighth
Text Retrieval Conference, 1999.
[29] C. Elkan, Deriving TF-IDF as a Fisher Kernel, Proceedings of the
International Symposium on String Processing and Information
Retrieval (SPIRE&#39;05), Buenos Aires, Argentina, 296-301, 2005.
[30] A. McCallum, K. Nigam, J. Rennie and K. Seymore, Automating the
Construction of Internet Portals with Machine Learning, Information
Retrieval Journal, 3, 127-163, 2000.
[31] S. Jones and P. Willett, Readings in information retrieval, Morgan
Kaufmann Publisher, 1997.</p>