Web usage mining has become a popular research
area, as a huge amount of data is available online. These data can be
used for several purposes, such as web personalization, web structure
enhancement, web navigation prediction etc. However, the raw log
files are not directly usable; they have to be preprocessed in order to
transform them into a suitable format for different data mining tasks.
One of the key issues in the preprocessing phase is to identify web
users. Identifying users based on web log files is not a
straightforward problem, thus various methods have been developed.
There are several difficulties that have to be overcome, such as client
side caching, changing and shared IP addresses and so on. This paper
presents three different methods for identifying web users. Two of
them are the most commonly used methods in web log mining
systems, whereas the third on is our novel approach that uses a
complex cookie-based method to identify web users. Furthermore we
also take steps towards identifying the individuals behind the
impersonal web users. To demonstrate the efficiency of the new
method we developed an implementation called Web Activity
Tracking (WAT) system that aims at a more precise distinction of
web users based on log data. We present some statistical analysis
created by the WAT on real data about the behavior of the Hungarian
web users and a comprehensive analysis and comparison of the three
methods
[1] M. S. Chen, J. S. Park, and P. S. Yu, "Data mining for path traversal
patterns in a web environment," in Sixteenth International Conference
on Distributed Computing Systems, 1996, pp. 385-392.
[2] J. Punin, M. Krishnamoorthy, and M. Zaki, "Web usage mining:
Languages and algorithms," in Studies in Classification, Data Analysis,
and Knowledge Organization. Springer-Verlag, 2001.
[3] P. Batista, M. ario, and J. Silva, "Mining web access logs of an on-line
newspaper," 2002
[4] O. R. Zaiane, M. Xin, and J. Han, "Discovering web access patterns and
trends by applying olap and data mining technology on web logs," in
ADL -98: Proceedings of the Advances in Digital Libraries Conference.
Washington, DC, USA: IEEE Computer Society, 1998, pp. 1-19.
[5] M. Eirinaki and M. Vazirgiannis, "Web mining for web
personalization," ACM Trans. Inter. Tech., vol. 3, no. 1, pp. 1-27, 2003.
[6] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu, "Mining access patterns
efficiently from web logs," in PADKK -00: Proceedings of the 4th
Pacific-Asia Conference on Knowledge Discovery and Data Mining,
Current Issues and New Applications. London, UK: Springer-Verlag,
2000, pp. 396-407.
[7] Z. Pabarskaite and A. Raudys, A process of knowledge discovery from
web log data: Systematization and critical review, Journal of Intelligent
Informatin Systems, Vol. 28. No. 1. 2007. pp. 79-104.
[8] J. Zhang and A. A. Ghorbani, "The reconstruction of user sessions from
a server log using improved timeoriented heuristics." in CNSR. IEEE
Computer Society, 2004, pp. 315-322.
[9] Robert Cooley and Bamshad Mobasher and Jaideep Srivastava, Data
Preparation for Mining World Wide Web Browsing Patterns,
Knowledge and Information Systems, Vol. 1. No. 1. 1999, pp. 5-32
[10] M. Spiliopoulou and C. Pohle and L. Faulstich, Improving the
Effectiveness of a Web Site with Web Usage Mining, WEBKDD '99:
Revised Papers from the International Workshop on Web Usage
Analysis and User Profiling, 2000. pp. 142-162.
[11] M. Gery, H. Haddad: "Evaluation of Web Usage Mining Approaches for
User-s Next Request Prediction", Fifth International Workshop on Web
Information and Data Management (WIDM'03), 2003. pp. 74-81.
[12] O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapuram, Mining Web
Access Logs Using Relational Competitive Fuzzy Clustering, Eight
International Fuzzy Systems Association World Congress - IFSA 99,
1999
[13] M. Spiliopoulou and B. Mobasher and B. Berendt and M. Nakagawa, A
Framework for the Evaluation of Session Reconstruction Heuristics in
Web Usage Analysis, INFORMS Journal on Computing, 15, 2003.
[14] T. Morzy, M. Wojciechowski, and M. Zakrzewicz. Web users clustering.
International Symposium on Computer and Information Sciences 2000.
[15] Brandt Dainow, ÔÇ×3rd Party Cookies Are Dead,", Web Analytics
Associations, http://www.webanalyticsassociation.org/en/art/?2
[16] W3C, Common Log Format,
http://www.w3.org/Daemon/User/Config/Logging.html
[17] Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating ecommerce
and data mining: Architecture and challenges. Data mining.
San Jose, CA: IEEE Computer Society.
[1] M. S. Chen, J. S. Park, and P. S. Yu, "Data mining for path traversal
patterns in a web environment," in Sixteenth International Conference
on Distributed Computing Systems, 1996, pp. 385-392.
[2] J. Punin, M. Krishnamoorthy, and M. Zaki, "Web usage mining:
Languages and algorithms," in Studies in Classification, Data Analysis,
and Knowledge Organization. Springer-Verlag, 2001.
[3] P. Batista, M. ario, and J. Silva, "Mining web access logs of an on-line
newspaper," 2002
[4] O. R. Zaiane, M. Xin, and J. Han, "Discovering web access patterns and
trends by applying olap and data mining technology on web logs," in
ADL -98: Proceedings of the Advances in Digital Libraries Conference.
Washington, DC, USA: IEEE Computer Society, 1998, pp. 1-19.
[5] M. Eirinaki and M. Vazirgiannis, "Web mining for web
personalization," ACM Trans. Inter. Tech., vol. 3, no. 1, pp. 1-27, 2003.
[6] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu, "Mining access patterns
efficiently from web logs," in PADKK -00: Proceedings of the 4th
Pacific-Asia Conference on Knowledge Discovery and Data Mining,
Current Issues and New Applications. London, UK: Springer-Verlag,
2000, pp. 396-407.
[7] Z. Pabarskaite and A. Raudys, A process of knowledge discovery from
web log data: Systematization and critical review, Journal of Intelligent
Informatin Systems, Vol. 28. No. 1. 2007. pp. 79-104.
[8] J. Zhang and A. A. Ghorbani, "The reconstruction of user sessions from
a server log using improved timeoriented heuristics." in CNSR. IEEE
Computer Society, 2004, pp. 315-322.
[9] Robert Cooley and Bamshad Mobasher and Jaideep Srivastava, Data
Preparation for Mining World Wide Web Browsing Patterns,
Knowledge and Information Systems, Vol. 1. No. 1. 1999, pp. 5-32
[10] M. Spiliopoulou and C. Pohle and L. Faulstich, Improving the
Effectiveness of a Web Site with Web Usage Mining, WEBKDD '99:
Revised Papers from the International Workshop on Web Usage
Analysis and User Profiling, 2000. pp. 142-162.
[11] M. Gery, H. Haddad: "Evaluation of Web Usage Mining Approaches for
User-s Next Request Prediction", Fifth International Workshop on Web
Information and Data Management (WIDM'03), 2003. pp. 74-81.
[12] O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapuram, Mining Web
Access Logs Using Relational Competitive Fuzzy Clustering, Eight
International Fuzzy Systems Association World Congress - IFSA 99,
1999
[13] M. Spiliopoulou and B. Mobasher and B. Berendt and M. Nakagawa, A
Framework for the Evaluation of Session Reconstruction Heuristics in
Web Usage Analysis, INFORMS Journal on Computing, 15, 2003.
[14] T. Morzy, M. Wojciechowski, and M. Zakrzewicz. Web users clustering.
International Symposium on Computer and Information Sciences 2000.
[15] Brandt Dainow, ÔÇ×3rd Party Cookies Are Dead,", Web Analytics
Associations, http://www.webanalyticsassociation.org/en/art/?2
[16] W3C, Common Log Format,
http://www.w3.org/Daemon/User/Config/Logging.html
[17] Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating ecommerce
and data mining: Architecture and challenges. Data mining.
San Jose, CA: IEEE Computer Society.
@article{"International Journal of Information, Control and Computer Sciences:49525", author = "Renáta Iváncsy and Sándor Juhász", title = "Analysis of Web User Identification Methods", abstract = "Web usage mining has become a popular research
area, as a huge amount of data is available online. These data can be
used for several purposes, such as web personalization, web structure
enhancement, web navigation prediction etc. However, the raw log
files are not directly usable; they have to be preprocessed in order to
transform them into a suitable format for different data mining tasks.
One of the key issues in the preprocessing phase is to identify web
users. Identifying users based on web log files is not a
straightforward problem, thus various methods have been developed.
There are several difficulties that have to be overcome, such as client
side caching, changing and shared IP addresses and so on. This paper
presents three different methods for identifying web users. Two of
them are the most commonly used methods in web log mining
systems, whereas the third on is our novel approach that uses a
complex cookie-based method to identify web users. Furthermore we
also take steps towards identifying the individuals behind the
impersonal web users. To demonstrate the efficiency of the new
method we developed an implementation called Web Activity
Tracking (WAT) system that aims at a more precise distinction of
web users based on log data. We present some statistical analysis
created by the WAT on real data about the behavior of the Hungarian
web users and a comprehensive analysis and comparison of the three
methods", keywords = "Data preparation, Tracking individuals, Web useridentification, Web usage mining", volume = "1", number = "10", pages = "2912-8", }