Abstract: Nowadays, big companies such as Google, Microsoft,
which have adequate proxy servers, have perfectly implemented
their web crawlers for a certain website in parallel. But due to
lack of expensive proxy servers, it is still a puzzle for researchers
to crawl large amounts of information from a single website in
parallel. In this case, it is a good choice for researchers to use
free public proxy servers which are crawled from the Internet. In
order to improve efficiency of web crawler, the following two issues
should be considered primarily: (1) Tasks may fail owing to the
instability of free proxy servers; (2) A proxy server will be blocked
if it visits a single website frequently. In this paper, we propose
Proxisch, an optimization approach of large-scale unstable proxy
servers scheduling, which allow anyone with extremely low cost to
run a web crawler efficiently. Proxisch is designed to work efficiently
by making maximum use of reliable proxy servers. To solve second
problem, it establishes a frequency control mechanism which can
ensure the visiting frequency of any chosen proxy server below the
website’s limit. The results show that our approach performs better
than the other scheduling algorithms.
Abstract: Weblogs are resource of social structure to discover and track the various type of information written by blogger. In this paper, we proposed to use mining weblogs technique for identifying the trends of influenza where blogger had disseminated their opinion for the anomaly disease. In order to identify the trends, web crawler is applied to perform a search and generated a list of visited links based on a set of influenza keywords. This information is used to implement the analytics report system for monitoring and analyzing the pattern and trends of influenza (H1N1). Statistical and graphical analysis reports are generated. Both types of the report have shown satisfactory reports that reflect the awareness of Malaysian on the issue of influenza outbreak through blogs.
Abstract: This paper proposes an auto-classification algorithm
of Web pages using Data mining techniques. We consider the
problem of discovering association rules between terms in a set of
Web pages belonging to a category in a search engine database, and
present an auto-classification algorithm for solving this problem that
are fundamentally based on Apriori algorithm. The proposed
technique has two phases. The first phase is a training phase where
human experts determines the categories of different Web pages, and
the supervised Data mining algorithm will combine these categories
with appropriate weighted index terms according to the highest
supported rules among the most frequent words. The second phase is
the categorization phase where a web crawler will crawl through the
World Wide Web to build a database categorized according to the
result of the data mining approach. This database contains URLs and
their categories.
Abstract: The explosive growth of World Wide Web has posed
a challenging problem in extracting relevant data. Traditional web
crawlers focus only on the surface web while the deep web keeps
expanding behind the scene. Deep web pages are created
dynamically as a result of queries posed to specific web databases.
The structure of the deep web pages makes it impossible for
traditional web crawlers to access deep web contents. This paper,
Deep iCrawl, gives a novel and vision-based approach for extracting
data from the deep web. Deep iCrawl splits the process into two
phases. The first phase includes Query analysis and Query translation
and the second covers vision-based extraction of data from the
dynamically created deep web pages. There are several established
approaches for the extraction of deep web pages but the proposed
method aims at overcoming the inherent limitations of the former.
This paper also aims at comparing the data items and presenting them
in the required order.