Experiments on Element and Document Statistics for XML Retrieval

This paper presents an information retrieval model on XML documents based on tree matching. Queries and documents are represented by extended trees. An extended tree is built starting from the original tree, with additional weighted virtual links between each node and its indirect descendants allowing to directly reach each descendant. Therefore only one level separates between each node and its indirect descendants. This allows to compare the user query and the document with flexibility and with respect to the structural constraints of the query. The content of each node is very important to decide weither a document element is relevant or not, thus the content should be taken into account in the retrieval process. We separate between the structure-based and the content-based retrieval processes. The content-based score of each node is commonly based on the well-known Tf × Idf criteria. In this paper, we compare between this criteria and another one we call Tf × Ief. The comparison is based on some experiments into a dataset provided by INEX1 to show the effectiveness of our approach on one hand and those of both weighting functions on the other.




References:
[1] World wide web consortium (w3c). extensible markup language (xml)
1.0. http://www.w3.org/TR/REC-xml, 2000.
[2] Inex - initiative for the evaluation of xml retrieval.
http://inex.is.informatik.uniduisburg.de, 2003.
[3] H. Blanken, R. Grabs, and G. Weikum. Intelligent search on xml.
Springer-Verlag, 2003.
[4] D. Carmel, Y. Maarek, S. Mandelbrod, M. Mass, and A. Soffer. Searching
xml documents via xml fragments. Proc. of the 24th annual ACM SIGIR
conference on research and development in Information Retrieval, pages
151-158, 2003.
[5] N. Fuhr and K. Grossjohann. Xirql: A query language for information
retrieval in xml documents. Proc. of the 24th annual ACM SIGIR
conference on research and development in Information Retrieval, New
Orlans, USA, pages 172-180, 2001.
[6] M. Fuller, E. Mackie, R. Sacks-Davis, and R. Wilkinson. Structural
answers for a large structured document collection. Proc. of the 24th
annual ACM SIGIR conference on research and development in Information
Retrieval, Pittsburgh, USA, pages 204-213, 1993.
[7] G. B. G. and Pasi. Flexible querying of structured documents. Proc.
of the fourth International Conference on Flexible Query Answering
Systems(FQAS), 2000.
[8] T. Grust. Accelerating xpath location steps. Proc. of the 2002 ACM
SIGMOD International Conference on Management of Data, Madison,
Wisconsin, USA, pages 109-120, 2002.
[9] J. Kamps, M. Marx, M. D. Rijke, and B. Sigurbjornsson. Xml retrieval
: What to retrieve ? Proc. of the 24th annual ACM SIGIR conference
on research and development in Information Retrieval, pages 409-410,
2003.
[10] G. Kazai, M. Lalmas, and T. Roelleke. A model for the representation
and focused retrieval of structured documents based on fuzzy aggregation.
Proc. of SPIRE2001, Chile, pages 123-135, 2001.
[11] M. Lalmas. Dempster-shafers theory of evidence applied to structured
documents: Modeling uncertainty. Proc. of the 24th annual ACM
SIGIR conference on research and development in Information Retrieval,
Philadelphia, USA, pages 110-118, 1997.
[12] R. Luk, H. Leong, T. Dillon, A. Chan, W. Croft, and J. Allan. A survey in
indexing and searching xml documents. Journal of the American Society
for Information Science and Technology, 6(53), 2000.
[13] M. Marx, J. Kamps, and M. D. Rijka. The university of amsterdam at
inex 2002. Proc. of the INEX 2002 Workshop, Germany, pages 23-28,
2002.
[14] A. Moffat, R. Sacks-Davis, R. Wilkinson, and J. Zobel. Retrieval of
partial documents. Proc. of TREC-2, 1993.
[15] F. N., G. N., K. G., and L. M. Inex : Evaluation initiative for xml
retrieval. Proc. of INEX 2002 Workshop, DELOS Workshop, 2003.
[16] T. Schlieder and H. Meuss. Querying and ranking xml documents.
Journal of the American Society for Information Science and Technology,
6(53):489-503, 2002.
[17] S. Selkow. The tree-to-tree edition problem. Information processing
letters, pages 184-186, 1977.
[18] R. Wilkinson. Effective retrieval of structured documents. Proc. of
the 24th annual ACM SIGIR conference on research and development in
Information Retrieval, Dublin, Ireland, pages 311-317, 1994.
[19] J. Wolff, H. Flrke, and A. Cremers. Searching and browsing collections
of structural information. Proc. of IEEE advances in digital libraries,
Washington, USA, pages 141-150, 2000.
[20] Y. Mass, M. Mandelbrod, E. Amitay, D. Carmel, Y. S. Maarek
and A. Soffer. JuruXML an XML retrieval system at INEX02.
http://inex.is.informatik.uni-duisburg.de:2003/proceedings.pdf, pages 73-
80, 2003.
[21] P. Ogilvie and J. Callan. Parameter estimation for a simple hierarchical
generative model for XML retrieval. http://inex.is.informatik.uniduisburg.
de:2005/proceedings.pdf, pages 211-224, 2005.
[22] XQuery: A query language for XML. http://www.w3.org/TR/xquery/,
2001.
[23] S. Amer-Yahia, B. Chavdar, J. Dorre and J. Shanmugasundaram. XQuery
full-text extensions explained. IBM Systems Journal, pages 335-352,
2006.
[24] K. Sauvagnat and M. Boughanem. The impact of leaf nodes relevance
values evaluation in a propagation method for XML retrieval. 3rd XML
and Information Retrieval Workshop, SIGIR 2004, Sheffield, England,
pages 19-22, 2004.