Auto Classification for Search Intelligence

This paper proposes an auto-classification algorithm of Web pages using Data mining techniques. We consider the problem of discovering association rules between terms in a set of Web pages belonging to a category in a search engine database, and present an auto-classification algorithm for solving this problem that are fundamentally based on Apriori algorithm. The proposed technique has two phases. The first phase is a training phase where human experts determines the categories of different Web pages, and the supervised Data mining algorithm will combine these categories with appropriate weighted index terms according to the highest supported rules among the most frequent words. The second phase is the categorization phase where a web crawler will crawl through the World Wide Web to build a database categorized according to the result of the data mining approach. This database contains URLs and their categories.




References:
[1] Kolcz, V. Prabakarmurthi, J.K. Kalita. "Summarization as feature
selection for text categorization". Proc. Of CIKM01, 2001.
[2] Z. Broder, S.C. Glassman, and M.S. Manasse, "Syntactic Clustering of
the Web," Proceedings of the 6th International World Wide Web
Conference, April 1997, pp. 391-404.
[3] Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal, "Web Search
Using Automatic Classification," Proceedings of the 6th International
World Wide Web Conference, April 1997.
[4] E. Rasmussen, "Chapter 16: Clustering Algorithms," in W. B. Frakes
and R. Baeza-Yates, editors, Information Retrieval: Data Structures
&Algorithms, Prentice Hall, 1992, pp. 419-442.
[5] G. Salton, editor. "The SMART retrieval system: experiments in
automatic document processing," Prentice-Hall Series in Automatic
Computation, Englewood Cliffs, New Jersey, 1971, Chapters 14-17.
[6] G. Salton, A. Wong, and C.S. Yang, "A Vector-Space Model for
Information Retrieval," Communications of the ACM, vol. 18, no. 11,
1975, pp. 613-620.
[7] H. Chen and S. T. Dumais. Bringing order to the Web: Automatically
categorizing search results. Proc. of CHI2000, 2000, 145-152.
[8] H. Mahmood, "CW3S: New Classification Algorithm for World Wide
Web Search Engines ",to appear at NITS'08, november 2008, Riyadh,
KSA.
[9] H. Zeng, Q. He, Z. Chen, W. Ma and J. Ma, "Learning to cluster Web
Search Results", The 27th Annual International ACM SIGIR Conference
(SIGIR'2004), July 2004
[10] J. L. Chen, B.Y. Zhou, J. Shi, H.J. Zhang, and Q.F. Wu. Function-based
Object Model Towards Website Adaptation, Proc. of WWW10, HK,
China, 2001.
[11] J. Pitkow and P. Pirolli, "Mining Longest Repeating Subsequences to
Predict World Wide Web Surfing," Proceedings of the 2nd USENIX
Symposium on Internet Technologies and Systems (USITS'99), Oct
1999, pp.139- 150.
[12] L. Al-Safadi, "Enhanced Arabic Search Engine", The Fifth International
Conference on Information Integration and Web-based Applications &
Services (iiWAS2003), Jakarta, Indonesia, September 15 - 17, 2003
[13] M. Hearst, J. Pedersen, "Reexamining the Cluster Hypothesis:
Scatter/Gather on Retrieval Results. In Proceedings of the 19th Annual
International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR'96), Zurich, June 1996.
[14] M. Houtsma and A. Swami. Set-oriented mining of association rules.
Research Report RJ 9567, IBM Almaden Research Center, San Jose,
California, October 1993.
[15] M. L. Shyu, S.-C. Chen, and C. Haruechaiyasak, "Mining User Access
Behavior on the WWW," IEEE International Conference on Systems,
Man, and Cybernetics, October 2001, pp. 1717-1722.
[16] M. L. Shyu, S.-C. Chen, C. Haruechaiyasak, C.-M. Shu, and S.-T. Li,
"Disjoint Web Document Clustering and Management in Electronic
Commerce," Proceedings of the Seventh International Conference on
Distributed Multimedia Systems (DMS-01), September 2001.
[17] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeing the whole
in parts: text summarization for Web browsing on handheld devices.
Proc. of WWW10, Hong Kong, China, May 2001.
[18] R. Cooley, B. Mobasher, and J. Srivastava, "Web Mining: Information
and Pattern Discovery on the World Wide Web," Proceedings of the 9th
IEEE International Conference on Tools with Artificial Intelligence
(ICTAI'97), November 1997, pp. 558-567.
[19] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association
Rules", Proceedings of the 20th VLDB Conference Santiago, Chile,
1994
[20] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules
between sets of items in large databases. In Proc. of the ACM SIGMOD
Conference on Management of Data, Washington, D.C., May 1993.
[21] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext
Categorization Using Hyperlinks. Proc. of the ACM SIGMOD, 1998.
[22] S. J. Ker and J.-N. Chen. A Text Categorization Based on
Summarization Technique. In the 38th Annual Meeting of the
Association for Computational Linguistics IR&NLP workshop, Hong
Kong, October 3-8, 2000.
[23] S. Miyamoto and K. Nakayama, "Fuzzy Information Retrieval Based on
a Fuzzy Pseudothesaurus," IEEE Transactions on Systems, Man, and
Cybernetics, vol. 16, no. 2, March/April 1986, pp. 278-282.
[24] T. Joachims. Transductive inference for text classification using support
vector machines. Proc. of ICML-99, Bled, Slovenia, June 1999.
[25] Y.J Ko, J.W Park, J.Y. Seo. Automatic Text Categorization using the
Importance of Sentences. Proc. of COLING 2002.
[26] Y. Li and R. Gopalan, "Effective Sampling for Mining Association
Rules", 17th Australian Joint Conference on Artificial Intelligence
Cairns, Australia, December 2004
[27] Y. Ogawa, T. Morita, and K. Kobayashi, "A Fuzzy Document Retrieval
System Using the Keyword Connection Matrix and a Learning Method,"
Fuzzy Sets and Systems, vol. 39, 1991, pp. 163-179.