Deep iCrawl: An Intelligent Vision-Based Deep Web Crawler
The explosive growth of World Wide Web has posed
a challenging problem in extracting relevant data. Traditional web
crawlers focus only on the surface web while the deep web keeps
expanding behind the scene. Deep web pages are created
dynamically as a result of queries posed to specific web databases.
The structure of the deep web pages makes it impossible for
traditional web crawlers to access deep web contents. This paper,
Deep iCrawl, gives a novel and vision-based approach for extracting
data from the deep web. Deep iCrawl splits the process into two
phases. The first phase includes Query analysis and Query translation
and the second covers vision-based extraction of data from the
dynamically created deep web pages. There are several established
approaches for the extraction of deep web pages but the proposed
method aims at overcoming the inherent limitations of the former.
This paper also aims at comparing the data items and presenting them
in the required order.
[1] XWRAP l.Liu, C.Pu and W.Han, "XWRAP: An XML-Enable Wrapper
Construction System for Web Information Sources"
[2] A.Sahugent and F.Azavant "Building Intelligent Web Applications
using Lightweighted Wrappers"
[3] V.Crezcenzi, G.Mecca and P.Merialdo "RoadRunner: Towards
Automatic Data Extraction from Large Websites", Proceedings of the
27th VLDB conference, 2003
[4] B.Liu,R.L.Grossman and Y.Zhai "Mining Data Record in Web Pages"
SIGKDD .03, August 24-27, 2003, Washington, DC, USA
[5] D.Cai , S.Yu, J.Wen and W.Ma,"Extracting Content Structure for Web
Pages Based on Visual Representation"
[6] Wei Liu and X. Meng "ViDE: A Vision-Based Approach for Deep Web
Data Extraction" IEEE Transactions On Knowledge And Data
Engineering, Vol. 22, No. 3, March 2010
[7] Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex
Rasmussen and Alon Halevy. "Google-s DeepWeb Crawl". PVLDB
'08, August 23-28, 2008, Auckland, New Zealand
[1] XWRAP l.Liu, C.Pu and W.Han, "XWRAP: An XML-Enable Wrapper
Construction System for Web Information Sources"
[2] A.Sahugent and F.Azavant "Building Intelligent Web Applications
using Lightweighted Wrappers"
[3] V.Crezcenzi, G.Mecca and P.Merialdo "RoadRunner: Towards
Automatic Data Extraction from Large Websites", Proceedings of the
27th VLDB conference, 2003
[4] B.Liu,R.L.Grossman and Y.Zhai "Mining Data Record in Web Pages"
SIGKDD .03, August 24-27, 2003, Washington, DC, USA
[5] D.Cai , S.Yu, J.Wen and W.Ma,"Extracting Content Structure for Web
Pages Based on Visual Representation"
[6] Wei Liu and X. Meng "ViDE: A Vision-Based Approach for Deep Web
Data Extraction" IEEE Transactions On Knowledge And Data
Engineering, Vol. 22, No. 3, March 2010
[7] Jayant Madhavan, David Ko, Łucja Kot, Vignesh Ganapathy, Alex
Rasmussen and Alon Halevy. "Google-s DeepWeb Crawl". PVLDB
'08, August 23-28, 2008, Auckland, New Zealand
@article{"International Journal of Information, Control and Computer Sciences:51500", author = "R.Anita and V.Ganga Bharani and N.Nityanandam and Pradeep Kumar Sahoo", title = "Deep iCrawl: An Intelligent Vision-Based Deep Web Crawler", abstract = "The explosive growth of World Wide Web has posed
a challenging problem in extracting relevant data. Traditional web
crawlers focus only on the surface web while the deep web keeps
expanding behind the scene. Deep web pages are created
dynamically as a result of queries posed to specific web databases.
The structure of the deep web pages makes it impossible for
traditional web crawlers to access deep web contents. This paper,
Deep iCrawl, gives a novel and vision-based approach for extracting
data from the deep web. Deep iCrawl splits the process into two
phases. The first phase includes Query analysis and Query translation
and the second covers vision-based extraction of data from the
dynamically created deep web pages. There are several established
approaches for the extraction of deep web pages but the proposed
method aims at overcoming the inherent limitations of the former.
This paper also aims at comparing the data items and presenting them
in the required order.", keywords = "Crawler, Deep web, Web Database", volume = "5", number = "2", pages = "124-6", }