Extraction of Data from Web Pages: A Vision Based Approach

With the explosive growth of information sources available on the World Wide Web, it has become increasingly difficult to identify the relevant pieces of information, since web pages are often cluttered with irrelevant content like advertisements, navigation-panels, copyright notices etc., surrounding the main content of the web page. Hence, tools for the mining of data regions, data records and data items need to be developed in order to provide value-added services. Currently available automatic techniques to mine data regions from web pages are still unsatisfactory because of their poor performance and tag-dependence. In this paper a novel method to extract data items from the web pages automatically is proposed. It comprises of two steps: (1) Identification and Extraction of the data regions based on visual clues information. (2) Identification of data records and extraction of data items from a data region. For step1, a novel and more effective method is proposed based on visual clues, which finds the data regions formed by all types of tags using visual clues. For step2 a more effective method namely, Extraction of Data Items from web Pages (EDIP), is adopted to mine data items. The EDIP technique is a list-based approach in which the list is a linear data structure. The proposed technique is able to mine the non-contiguous data records and can correctly identify data regions, irrespective of the type of tag in which it is bound. Our experimental results show that the proposed technique performs better than the existing techniques.





References:
[1] Baeza Yates, R. Algorithms for string matching: A survey. ACM SIGIR
Forum, 23(3-4): 34ÔÇö58, 1989.
[2] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo. Extracting semistructured
information from the web.In Proc.of the Workshop on the
Management of Semi-structured Data, 1997.
[3] D. Embley, Y. Jiang, and Y. K. Ng. Record-boundary discovery in Web
documents. ACM SIGMOD Conference, 1999.
[4] Kushmerick, N. Wrapper Induction: Efficiency and Expressiveness.
Artificial Intelligence, 118:15-68, 2000. Clustering-based Approach to
Integrating Source Query].
[5] Chang, C-H., Lui, S-L. IEPAD: Information Extraction Based on Pattern
Discovery. WWW-01, 2001.]
[6] Crescenzi, V., Mecca, G. and Merialdo, P. ROADRUNNER: Towards
Automatic Data Extraction from Large Web Sites. VLDB-01, 2001.]
[7] Eying, H. Zhang. HTML Page Analysis based on Visual Cues. 6th
International Conference on Document Analysis and Recognition, 2001.
[8] D. Buttler, L. Liu, C. Pu. A Fully Automated Object Extraction System
for the World Wide Web. International Conference on Distributed
Computing Systems (ICDCS 2001), 2001.
[9] Bing Liu , Kevin chen-chuan chang, Editorial: Special issue on web
content mining, WWW 02, 2002.
[10] Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages.
KDD-03, 2003.
[11] Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. (2003). Extracting Content
Structure for Web Pages based on Visual Representation, Asia Pacific
Web Conference (APWeb 2003), pp. 406417.
[12] A. Arasu, H. Garcia-Molina, Extracting structured data from web pages,
ACM SIGMOD 2003, 2003.
[13] J. Wang, F. H Lochovsky. Data Extraction and Label Assignment for
Web Databases.WWW conference, 2003.
[14] H. Zhao, W. Meng, Z. Wu, Raghavan, Clement Yu. Fully Automatic
Wrapper Generation For Search Engines, International WWW
conference 2005, May 10-14,2005, Japan. ACM 1-59593-046-9/05/005.
[15] Zhai, Y., Liu, B. Web Data Extraction Based on Partial Tree Alignment,
WWW-05, 2005, May 10-14, 2005, Chiba, Japan. ACM 1-59593-046-
9/05/00.
[16] Hiremath P.S, Benchalli S.S, Algur Siddu P, Minig Data Regions from
Web Pages, COMMAD 2005b.
[17] Algur Siddu P, Hiremath P.S, Extraction of Data from Web - Some
Aspects, IICT - 2007