Towards Clustering of Web-based Document Structures
Methods for organizing web data into groups in order
to analyze web-based hypertext data and facilitate data availability
are very important in terms of the number of documents available
online. Thereby, the task of clustering web-based document structures
has many applications, e.g., improving information retrieval on the
web, better understanding of user navigation behavior, improving web
users requests servicing, and increasing web information accessibility.
In this paper we investigate a new approach for clustering web-based
hypertexts on the basis of their graph structures. The hypertexts will
be represented as so called generalized trees which are more general
than usual directed rooted trees, e.g., DOM-Trees. As a important
preprocessing step we measure the structural similarity between the
generalized trees on the basis of a similarity measure d. Then,
we apply agglomerative clustering to the obtained similarity matrix
in order to create clusters of hypertext graph patterns representing
navigation structures. In the present paper we will run our approach
on a data set of hypertext structures and obtain good results in
Web Structure Mining. Furthermore we outline the application of
our approach in Web Usage Mining as future work.
[1] R. Bellman, Dynamic Programming. Princeton University Press, 1957
[2] H. H. Bock: Automatische Klassifikation. Theoretische und praktische
Methoden zur Gruppierung und Strukturierung von Daten, Studia Mathematica
- Mathematische Lehrb¨ucher, Vandenhoeck & Ruprecht Verlag,
1974
[3] R. A. Botafogo, B. Shneiderman: Structural analysis of hypertexts:
Identifying hierarchies and useful metrics, ACM Trans. Inf. Syst. 10
(2), 1992, 142-180
[4] S. Chakrabarti: Mining the Web. Discovering Knowledge from Hypertext
Data, Morgen and Kaufmann Publishers, 2003
[5] S. Chakrabarti: Integrating the document object model with hyperlinks
for enhanced topic distillation and information extraction, Proc. of the
10th International World Wide Web Conference, Hong Kong, 2001, 211-
220
[6] I. F. Cruz, S. Borisov, M. A. Marks, T. R. Webb: Measuring Structural
Similarity Among Web Documents: Preliminary Results , Lecture Notes
In Computer Science, Vol. 1375, 1998
[7] M. Dehmer, Strukturelle Analyse web-basierter Dokumente, Ph.D Thesis,
Department of Computer Science, Technische Universit¨at Darmstadt,
2005
[8] M. Dehmer, F. Emmert-Streib, A. Mehler, J. Kilian, M. M¨uhlh¨auser,
Application of a similarity measure for graphs to web-based document
structures, International Conference on Data Analysis ICA 2005, in conjuction
with the 7-th World Enformatika Conference, Budapest/Hungary
[9] B. S. Everitt, S. Landau, M. Leese: Cluster Analysis, Arnold Publishers;
4-th edition, 2001
[10] R. Gleim: HyGraph - Ein Framework zur Extraktion, Repr¨asentation
und Analyse webbasierter Hypertextstrukturen, Beitr¨age zur GLDVTagung
2005, Bonn/Germany, 2005
[11] A. K. Jain, R. C. Dubes: Algorithms for Clustering Data, Prentice Hall,
1988
[12] A. Mehler, M. Dehmer, R. Gleim: Towards logical hypertext structure.
A graph-theoretic perspective, Proc. of I2CS-04, Guadalajara/Mexico,
Lecture Notes in Computer Science, Berlin-New York: Springer, 2004
[13] M. M¨uhlh¨auser: eLearning After Four Decades: What About Sustainability?,
Proceedings of ED-MEDIA 2004, 3694-3700
[14] T. Richter, J. Naumann, S. Noller: LOGPAT: A semi-automatic way to
analyze hypertext navigation behavior, Swiss Journal of Psychology,
Vol. 62, 2003, 113-120
[15] B. Rieger: Unscharfe Semantik, Peter Lang Verlag, 1989
[16] P. H. Winne., L. Gupta, J. C. Nesbit: Exploring individual differences in
studying strategies using graph theoretic statistics, The Alberta Journal
of Educational Research, Vol. 40, 177-193, 1994
[1] R. Bellman, Dynamic Programming. Princeton University Press, 1957
[2] H. H. Bock: Automatische Klassifikation. Theoretische und praktische
Methoden zur Gruppierung und Strukturierung von Daten, Studia Mathematica
- Mathematische Lehrb¨ucher, Vandenhoeck & Ruprecht Verlag,
1974
[3] R. A. Botafogo, B. Shneiderman: Structural analysis of hypertexts:
Identifying hierarchies and useful metrics, ACM Trans. Inf. Syst. 10
(2), 1992, 142-180
[4] S. Chakrabarti: Mining the Web. Discovering Knowledge from Hypertext
Data, Morgen and Kaufmann Publishers, 2003
[5] S. Chakrabarti: Integrating the document object model with hyperlinks
for enhanced topic distillation and information extraction, Proc. of the
10th International World Wide Web Conference, Hong Kong, 2001, 211-
220
[6] I. F. Cruz, S. Borisov, M. A. Marks, T. R. Webb: Measuring Structural
Similarity Among Web Documents: Preliminary Results , Lecture Notes
In Computer Science, Vol. 1375, 1998
[7] M. Dehmer, Strukturelle Analyse web-basierter Dokumente, Ph.D Thesis,
Department of Computer Science, Technische Universit¨at Darmstadt,
2005
[8] M. Dehmer, F. Emmert-Streib, A. Mehler, J. Kilian, M. M¨uhlh¨auser,
Application of a similarity measure for graphs to web-based document
structures, International Conference on Data Analysis ICA 2005, in conjuction
with the 7-th World Enformatika Conference, Budapest/Hungary
[9] B. S. Everitt, S. Landau, M. Leese: Cluster Analysis, Arnold Publishers;
4-th edition, 2001
[10] R. Gleim: HyGraph - Ein Framework zur Extraktion, Repr¨asentation
und Analyse webbasierter Hypertextstrukturen, Beitr¨age zur GLDVTagung
2005, Bonn/Germany, 2005
[11] A. K. Jain, R. C. Dubes: Algorithms for Clustering Data, Prentice Hall,
1988
[12] A. Mehler, M. Dehmer, R. Gleim: Towards logical hypertext structure.
A graph-theoretic perspective, Proc. of I2CS-04, Guadalajara/Mexico,
Lecture Notes in Computer Science, Berlin-New York: Springer, 2004
[13] M. M¨uhlh¨auser: eLearning After Four Decades: What About Sustainability?,
Proceedings of ED-MEDIA 2004, 3694-3700
[14] T. Richter, J. Naumann, S. Noller: LOGPAT: A semi-automatic way to
analyze hypertext navigation behavior, Swiss Journal of Psychology,
Vol. 62, 2003, 113-120
[15] B. Rieger: Unscharfe Semantik, Peter Lang Verlag, 1989
[16] P. H. Winne., L. Gupta, J. C. Nesbit: Exploring individual differences in
studying strategies using graph theoretic statistics, The Alberta Journal
of Educational Research, Vol. 40, 177-193, 1994
@article{"International Journal of Information, Control and Computer Sciences:59711", author = "Matthias Dehmer and Frank Emmert Streib and Jürgen Kilian and Andreas Zulauf", title = "Towards Clustering of Web-based Document Structures", abstract = "Methods for organizing web data into groups in order
to analyze web-based hypertext data and facilitate data availability
are very important in terms of the number of documents available
online. Thereby, the task of clustering web-based document structures
has many applications, e.g., improving information retrieval on the
web, better understanding of user navigation behavior, improving web
users requests servicing, and increasing web information accessibility.
In this paper we investigate a new approach for clustering web-based
hypertexts on the basis of their graph structures. The hypertexts will
be represented as so called generalized trees which are more general
than usual directed rooted trees, e.g., DOM-Trees. As a important
preprocessing step we measure the structural similarity between the
generalized trees on the basis of a similarity measure d. Then,
we apply agglomerative clustering to the obtained similarity matrix
in order to create clusters of hypertext graph patterns representing
navigation structures. In the present paper we will run our approach
on a data set of hypertext structures and obtain good results in
Web Structure Mining. Furthermore we outline the application of
our approach in Web Usage Mining as future work.", keywords = "Clustering methods, graph-based patterns, graph similarity,
hypertext structures, web structure mining", volume = "1", number = "10", pages = "3208-6", }