Information Extraction from Unstructured and Ungrammatical Data Sources for Semantic Annotation

The internet has become an attractive avenue for global e-business, e-learning, knowledge sharing, etc. Due to continuous increase in the volume of web content, it is not practically possible for a user to extract information by browsing and integrating data from a huge amount of web sources retrieved by the existing search engines. The semantic web technology enables advancement in information extraction by providing a suite of tools to integrate data from different sources. To take full advantage of semantic web, it is necessary to annotate existing web pages into semantic web pages. This research develops a tool, named OWIE (Ontology-based Web Information Extraction), for semantic web annotation using domain specific ontologies. The tool automatically extracts information from html pages with the help of pre-defined ontologies and gives them semantic representation. Two case studies have been conducted to analyze the accuracy of OWIE.




References:
[1] Adelberg, B.: NoDoSE A Tool For Semi-Automatically Extracting
Structured And Semistructured Data From Text Documents. In
Proceedings of the ACM SIGMOD International Conference on
Management of data, Seattle Washington (1998)
[2] Antoniou, G., Harmelen, F.V.: A Semantic Web Primer. 2nd Edition.
MIT Press (2004)
[3] Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents,
Databases and Webs. In Proceedings of the 14th International Conference
on Data Engineering, Florida (1998)
[4] Berendt, B., Hotho, A., Mladenic, D., someren, M.V., Spiliopoulou M.,
Stumme G.: A Roadmap for Web Mining: from Web to Semantic
Web. Lecture Notes in Computer Science European Web Mining Forum
(EWMF), Springer-Verlag Berlin Heidelberg (2004)
[5] Berendt, B., Hotho, A., Stumme, G.: Towards Semantic Web Mining. In
Proceedings of the 1st International Semantic Web Conference (ISWC),
Sardinia Italy (2002)
[6] Crescenzi, V., Mecca, G., and Merialdo, P.: RoadRunner: Towards
Automatic Data Extraction From Large Web Sites. In Proceedings of the
26th International Conference on very large Data Bases, Rome Italy
(2001)
[7] Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale,
D.W., Ng, Y.k., Smith, R.D.: Conceptual-Model-Based Data Extraction
from Multiple-Record Web Pages. Journal of Data and Knowledge
Engineering, Vol.31(3), (1999) 227-251
[8] Embley, D.W., Tao, C., Liddle, S.W.: Automating the Extraction of Data
from HTML Tables with Unknown Structure. Journal of Data &
knowledge Engineering. Vol. 54(1), (2005) 3-28
[9] Embley D.W., Ding Y., Liddle S. W., and Vickers M.: Automatic
Creation And Simplified Querying Of Semantic Web Content. In
Proceedings of First Asian Semantic Conference (ASWC), Beijing
China (2006)
[10] Fiumara, G.: Automatic Information Extraction from Web Sources: A
Survey. In Proceedings of the Workshop between Ontologies and
Folksonomies (BOF). Michigan USA (2007)
[11] Garcia-Molina, H., Hammer, J., McHugh, J.: Semistructured Data: The
Tsimmis Experience. In Proceedings of First East-European Workshop
on Advances in Database and Information Systems (ADBIS). St.
Petersburg Russia (1997)
[12] Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM Semi-automatic
CREAtion of Metadata. In Proceedings of 13th International Conference
on Knowledge Engineering and Knowledge Management (EKAW),
Siguenza Spain (2002)
[13] Hieu, L.Q.: Integration of Web Data Sources: A Survey of Existing
Problems. In Proceedings of 17th GI-Workshop on the Foundations of
Databases, Wörlitz in Saxony-Anhalt Germany (2005) 78-82
[14] Laender, A.H.F., Ribeiro-Neto, B.A., da Silva A.S., Teixeira J.S.: A
Brief Survey of Web Data Extraction Tools. In ACM SIGMOD Record,
Vol. 31(2) (2002) 84-93
[15] Madhavan, J., Jeffery, S., Cohen, S., Dong, L., Ko, D., Yu, C., Halevy,
A.: Web-scale Data Integration: You can only afford to Pay As You Go.
In Proceedings of Third Biennial Conference on Innovative Data
Systems Research (CIDR), Pacific Grove California (2007)
[16] Mika, P., Social Networks and the Semantic Web Series: Semantic Web
and Beyond. Springer, (2007)
[17] Musela, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction
For Semistructured Information Sources. Journal of Autonomous Agents
and Multi-Agent systems. Vol. 4(1-2) (2001) 93-114
[18] Reeve, L., Han, H : Survey of Semantic Annotation Platforms. In
Proceedings of the 20th Annual ACM Symposium on Applied
Computing, Web Technologies and Applications track, Santa Fe New
Mexico (2005)
[19] Sahuguet, A., Azavant, F.: Building Light-Weight Wrappers for Legacy
Web Data-Sources Using W4F. In Proceeding of 25th International
Conference on Very Large Databases (VLDB). Edinburgh Scotland
(1999)
[20] Soderland, S.: Learning Information Extraction Rules For Semi-
Structured and Free Text. Machine Learning. Vol. 34 (1-3). (1999) 233-
272
[21] Tang, J., Li, J., Lu, H., Liang, B., Huang, X., Wang, K.: IASA: Learning
to Annotate the Semantic Web. Journal on Data Semantics. Vol. 4.
(2005) 110-145
[22] Tjoa, A., Wagner, R., Andjomshoa, A., Shayeganfar, F.: Semantic Web:
Challenges and New Requirements. In Proceedings. Sixteenth
International Workshop on Database and Expert Systems Application
(DEXA). Copenhagen Denmark (2005) 1160 - 1163
[23] Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A.,
Ciravegna, F: MnM: Ontology Driven Semi-Automatic and Automatic
Support for Semantic Markup. In Proceedings of The 13th International
Conference on Knowledge Engineering and Management. Seguenza
Spain (2002)
[24] Wilson, M., Matthews, B.: The Semantic Web: Prospects And
Challenges. In Proceeding of 7th International Baltic Conference on
Databases and Information Systems. Vilnius Lithuania (2006)
[25] Yildiz, B., Miksch, S.: Motivating ontology-driven information
extraction. In Prasad, A., Madalli, D., eds.: International Conference on
Semantic Web and Digital Libraries. Indian Statistical Institute Platinum
Jubilee Conference Series (2007) 45-53
[26] Yildiz Burcu, Miksch Silvia. ontoX - A Method for Ontology-Driven
Information Extraction. In: Computational Science and Its Applications
(ICCSA 2007), LNCS 4707, Springer-Verlag, 2007, S. 660 - 673.