Application of a Similarity Measure for Graphs to Web-based Document Structures
Due to the tremendous amount of information provided
by the World Wide Web (WWW) developing methods for mining
the structure of web-based documents is of considerable interest. In
this paper we present a similarity measure for graphs representing
web-based hypertext structures. Our similarity measure is mainly
based on a novel representation of a graph as linear integer strings,
whose components represent structural properties of the graph. The
similarity of two graphs is then defined as the optimal alignment of
the underlying property strings. In this paper we apply the well known
technique of sequence alignments for solving a novel and challenging
problem: Measuring the structural similarity of generalized trees.
In other words: We first transform our graphs considered as high
dimensional objects in linear structures. Then we derive similarity
values from the alignments of the property strings in order to
measure the structural similarity of generalized trees. Hence, we
transform a graph similarity problem to a string similarity problem for
developing a efficient graph similarity measure. We demonstrate that
our similarity measure captures important structural information by
applying it to two different test sets consisting of graphs representing
web-based document structures.
[1] R. Bellman, Dynamic Programming. Princeton University Press, 1957
[2] R. A. Botafogo, B. Shneiderman: Structural analysis of hypertexts:
Identifying hierarchies and useful metrics, ACM Trans. Inf. Syst. 10
(2), 1992, 142-180
[3] S. Chakrabarti: Mining the Web. Discovering Knowledge from Hypertext
Data, Morgen and Kaufmann Publishers, 2003
[4] S. Chakrabarti: Integrating the document object model with hyperlinks
for enhanced topic distillation and information extraction, Proc. of the
10th International World Wide Web Conference, Hong Kong, 2001, 211-
220
[5] I. F. Cruz, S. Borisov, M. A. Marks, T. R. Webb: Measuring Structural
Similarity Among Web Documents: Preliminary Results , Lecture Notes
In Computer Science, Vol. 1375, 1998
[6] M. Dehmer, Strukturelle Analyse web-basierter Dokumente, Ph.D Thesis,
Department of Computer Science, Technische Universit¨at Darmstadt,
2005, unpublished
[7] R. Gleim: HyGraph - Ein Framework zur Extraktion, Repr¨asentation
und Analyse webbasierter Hypertextstrukturen, Beitrage zur GLDVTagung
2005, Bonn/Germany, 2005
[8] D. Gusfield: Algorithms on Strings, Trees, and Sequences: Computer
Science and Computational Biology, Cambridge University Press, 1997
[9] T. Jiang, L. Wang, K. Zhang: Alignment of trees - An alternative to tree
edit, Theoretical Computer Science, Elsevier, Vol. 143, 1995, 137-148
[10] S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi,: Bag of Paths Model
for Measuring Structural Similarity in Web Documents, Proceedings of
the ACM International Conference on Knowledge Discovery and Data
Mining (SIGKDD), 2003, 577-582.
[11] A. Mehler, M. Dehmer, R. Gleim: Towards logical hypertext structure.
A graph-theoretic perspective, Proc. of I2CS-04, Guadalajara/Mexico,
Lecture Notes in Computer Science, Berlin-New York: Springer, 2004
[12] A. Mehler, R. Gleim, M. Dehmer: Towards structure-sensitive hypertext
categorization, to appear in: Proceedings of the 29-th Annual Conference
of the German Classification Society, 2005
[13] S. M. Selkow: The tree-to-tree editing problem, Information Processing
Letters, Vol. 6 (6), 1977, 184-186
[14] T. F. Smith, M. S. Waterman: Identification of common molecular
subsequences, Journal of Molecular Biology, Vol. 147 (1), 1981, 195-
197
[15] F. Sobik, Graphmetriken und Klassifikation strukturierter Objekte, ZKIInformationen,
Akad. Wiss. DDR, Vol. 2 (82), 1982, 63-122
[16] J. R. Ullman, An algorithm for subgraph isomorphism, J. ACM, Vol. 23
(1), 1976, 31-42
[17] P. H. Winne., L. Gupta, J. C. Nesbit: Exploring individual differences in
studying strategies using graph theoretic statistics, The Alberta Journal
of Educational Research, Vol. 40, 177-193, 1994
[18] A. Winter: Exchanching Graphs with GXL, http://www.gupro.
de/GXL
[19] K. Zhang, D. Shasha: Simple fast algorithms for the editing distance
between trees and related problems, SIAM Journal of Computing, Vol.
18 (6), 1989, 1245-1262
[20] B. Zelinka, On a certain distance between isomorphism classes of
graphs, ˇCasopis pro ˇpest. Mathematiky, Vol. 100, 1975, 371-373
[1] R. Bellman, Dynamic Programming. Princeton University Press, 1957
[2] R. A. Botafogo, B. Shneiderman: Structural analysis of hypertexts:
Identifying hierarchies and useful metrics, ACM Trans. Inf. Syst. 10
(2), 1992, 142-180
[3] S. Chakrabarti: Mining the Web. Discovering Knowledge from Hypertext
Data, Morgen and Kaufmann Publishers, 2003
[4] S. Chakrabarti: Integrating the document object model with hyperlinks
for enhanced topic distillation and information extraction, Proc. of the
10th International World Wide Web Conference, Hong Kong, 2001, 211-
220
[5] I. F. Cruz, S. Borisov, M. A. Marks, T. R. Webb: Measuring Structural
Similarity Among Web Documents: Preliminary Results , Lecture Notes
In Computer Science, Vol. 1375, 1998
[6] M. Dehmer, Strukturelle Analyse web-basierter Dokumente, Ph.D Thesis,
Department of Computer Science, Technische Universit¨at Darmstadt,
2005, unpublished
[7] R. Gleim: HyGraph - Ein Framework zur Extraktion, Repr¨asentation
und Analyse webbasierter Hypertextstrukturen, Beitrage zur GLDVTagung
2005, Bonn/Germany, 2005
[8] D. Gusfield: Algorithms on Strings, Trees, and Sequences: Computer
Science and Computational Biology, Cambridge University Press, 1997
[9] T. Jiang, L. Wang, K. Zhang: Alignment of trees - An alternative to tree
edit, Theoretical Computer Science, Elsevier, Vol. 143, 1995, 137-148
[10] S. Joshi, N. Agrawal, R. Krishnapuram, S. Negi,: Bag of Paths Model
for Measuring Structural Similarity in Web Documents, Proceedings of
the ACM International Conference on Knowledge Discovery and Data
Mining (SIGKDD), 2003, 577-582.
[11] A. Mehler, M. Dehmer, R. Gleim: Towards logical hypertext structure.
A graph-theoretic perspective, Proc. of I2CS-04, Guadalajara/Mexico,
Lecture Notes in Computer Science, Berlin-New York: Springer, 2004
[12] A. Mehler, R. Gleim, M. Dehmer: Towards structure-sensitive hypertext
categorization, to appear in: Proceedings of the 29-th Annual Conference
of the German Classification Society, 2005
[13] S. M. Selkow: The tree-to-tree editing problem, Information Processing
Letters, Vol. 6 (6), 1977, 184-186
[14] T. F. Smith, M. S. Waterman: Identification of common molecular
subsequences, Journal of Molecular Biology, Vol. 147 (1), 1981, 195-
197
[15] F. Sobik, Graphmetriken und Klassifikation strukturierter Objekte, ZKIInformationen,
Akad. Wiss. DDR, Vol. 2 (82), 1982, 63-122
[16] J. R. Ullman, An algorithm for subgraph isomorphism, J. ACM, Vol. 23
(1), 1976, 31-42
[17] P. H. Winne., L. Gupta, J. C. Nesbit: Exploring individual differences in
studying strategies using graph theoretic statistics, The Alberta Journal
of Educational Research, Vol. 40, 177-193, 1994
[18] A. Winter: Exchanching Graphs with GXL, http://www.gupro.
de/GXL
[19] K. Zhang, D. Shasha: Simple fast algorithms for the editing distance
between trees and related problems, SIAM Journal of Computing, Vol.
18 (6), 1989, 1245-1262
[20] B. Zelinka, On a certain distance between isomorphism classes of
graphs, ˇCasopis pro ˇpest. Mathematiky, Vol. 100, 1975, 371-373
@article{"International Journal of Engineering, Mathematical and Physical Sciences:64297", author = "Matthias Dehmer and Frank Emmert Streib and Alexander Mehler and Jürgen Kilian and Max Mühlhauser", title = "Application of a Similarity Measure for Graphs to Web-based Document Structures", abstract = "Due to the tremendous amount of information provided
by the World Wide Web (WWW) developing methods for mining
the structure of web-based documents is of considerable interest. In
this paper we present a similarity measure for graphs representing
web-based hypertext structures. Our similarity measure is mainly
based on a novel representation of a graph as linear integer strings,
whose components represent structural properties of the graph. The
similarity of two graphs is then defined as the optimal alignment of
the underlying property strings. In this paper we apply the well known
technique of sequence alignments for solving a novel and challenging
problem: Measuring the structural similarity of generalized trees.
In other words: We first transform our graphs considered as high
dimensional objects in linear structures. Then we derive similarity
values from the alignments of the property strings in order to
measure the structural similarity of generalized trees. Hence, we
transform a graph similarity problem to a string similarity problem for
developing a efficient graph similarity measure. We demonstrate that
our similarity measure captures important structural information by
applying it to two different test sets consisting of graphs representing
web-based document structures.", keywords = "Graph similarity, hierarchical and directed graphs,hypertext, generalized trees, web structure mining.", volume = "1", number = "8", pages = "390-5", }