Information Retrieval in Domain Specific Search Engine with Machine Learning Approaches

As the web continues to grow exponentially, the idea of crawling the entire web on a regular basis becomes less and less feasible, so the need to include information on specific domain, domain-specific search engines was proposed. As more information becomes available on the World Wide Web, it becomes more difficult to provide effective search tools for information access. Today, people access web information through two main kinds of search interfaces: Browsers (clicking and following hyperlinks) and Query Engines (queries in the form of a set of keywords showing the topic of interest) [2]. Better support is needed for expressing one's information need and returning high quality search results by web search tools. There appears to be a need for systems that do reasoning under uncertainty and are flexible enough to recover from the contradictions, inconsistencies, and irregularities that such reasoning involves. In a multi-view problem, the features of the domain can be partitioned into disjoint subsets (views) that are sufficient to learn the target concept. Semi-supervised, multi-view algorithms, which reduce the amount of labeled data required for learning, rely on the assumptions that the views are compatible and uncorrelated. This paper describes the use of semi-structured machine learning approach with Active learning for the “Domain Specific Search Engines". A domain-specific search engine is “An information access system that allows access to all the information on the web that is relevant to a particular domain. The proposed work shows that with the help of this approach relevant data can be extracted with the minimum queries fired by the user. It requires small number of labeled data and pool of unlabelled data on which the learning algorithm is applied to extract the required data.

Authors:



References:
[1] LookOff E-book, Engine Basics,
http://www.lookoff.com/tactics/engines.php3 , Oct 24 2000.
[2] M. Jaczynski, B. Trousse, Broadway: A Case-Based System for
Cooperative Information Browsing on the World-Wide-web,
Collaboration between Human and Artificial Societies, pp. 264-283,
1999.
[3] Internet Fact and State, http://optistreams.com/factsandstats15.htm
[4] The Censorware Project, http://www.censorware.org/web_size, Jan. 26,
1999
[5] S. Lawrence and C.L. Giles, Searching the World Wide Web, Science
80:98-100, 1998.
[6] S. Lawrence and C.L. Giles, Accessibility of Information on the Web,
Nature 400:107-109,1999.
[7] S. Chakrabarti, Data mining for hypertext: A tutorial survey, SIGKDD:
SIGKDD Explorations: Newsletter of the Special Interest Group (SIG)
on Knowledge Discovery & Data Mining, ACM 1(2): 1-11, 2000.
[8] L. Page, S. Brin, The anatomy of a large-scale hypertext web search
engine, Proceeding of the seventh International World Wide Web
Conference, 1998.
[9] S. Mizzaro, Relevance: The whole history, Journal of the American
Society for Information Science, 48(9): 810-832, 1997.
[10] S. Lawrence, Context in web Search, IEEE Data Engineering Bulletin,
[11] Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data
with co-training. Proc. of the Conference on Computational Learning
Theory (pp. 92-100).
[12] Collins, M., & Singer, Y. (1999). Unsupervised models for named entity
classification. Proc. of the Empirical NLP and Very Large Corpora
Conference (pp. 100-110). de Sa, V., & Ballard, D. (1998).
[13] T. M. Mitchell, Machine Learning, New York: McGraw-Hill, 1997.
[14] S. Chakrabarti, M. van der Berg, and B. Dom, Focused crawling: a new
approach to topic-specific web resource discovery, Proceeding of the
8th International World Wide Web Conference (WWW8), 1999.