Analysis of Web User Identification Methods

Web usage mining has become a popular research area, as a huge amount of data is available online. These data can be used for several purposes, such as web personalization, web structure enhancement, web navigation prediction etc. However, the raw log files are not directly usable; they have to be preprocessed in order to transform them into a suitable format for different data mining tasks. One of the key issues in the preprocessing phase is to identify web users. Identifying users based on web log files is not a straightforward problem, thus various methods have been developed. There are several difficulties that have to be overcome, such as client side caching, changing and shared IP addresses and so on. This paper presents three different methods for identifying web users. Two of them are the most commonly used methods in web log mining systems, whereas the third on is our novel approach that uses a complex cookie-based method to identify web users. Furthermore we also take steps towards identifying the individuals behind the impersonal web users. To demonstrate the efficiency of the new method we developed an implementation called Web Activity Tracking (WAT) system that aims at a more precise distinction of web users based on log data. We present some statistical analysis created by the WAT on real data about the behavior of the Hungarian web users and a comprehensive analysis and comparison of the three methods




References:
[1] M. S. Chen, J. S. Park, and P. S. Yu, "Data mining for path traversal
patterns in a web environment," in Sixteenth International Conference
on Distributed Computing Systems, 1996, pp. 385-392.
[2] J. Punin, M. Krishnamoorthy, and M. Zaki, "Web usage mining:
Languages and algorithms," in Studies in Classification, Data Analysis,
and Knowledge Organization. Springer-Verlag, 2001.
[3] P. Batista, M. ario, and J. Silva, "Mining web access logs of an on-line
newspaper," 2002
[4] O. R. Zaiane, M. Xin, and J. Han, "Discovering web access patterns and
trends by applying olap and data mining technology on web logs," in
ADL -98: Proceedings of the Advances in Digital Libraries Conference.
Washington, DC, USA: IEEE Computer Society, 1998, pp. 1-19.
[5] M. Eirinaki and M. Vazirgiannis, "Web mining for web
personalization," ACM Trans. Inter. Tech., vol. 3, no. 1, pp. 1-27, 2003.
[6] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu, "Mining access patterns
efficiently from web logs," in PADKK -00: Proceedings of the 4th
Pacific-Asia Conference on Knowledge Discovery and Data Mining,
Current Issues and New Applications. London, UK: Springer-Verlag,
2000, pp. 396-407.
[7] Z. Pabarskaite and A. Raudys, A process of knowledge discovery from
web log data: Systematization and critical review, Journal of Intelligent
Informatin Systems, Vol. 28. No. 1. 2007. pp. 79-104.
[8] J. Zhang and A. A. Ghorbani, "The reconstruction of user sessions from
a server log using improved timeoriented heuristics." in CNSR. IEEE
Computer Society, 2004, pp. 315-322.
[9] Robert Cooley and Bamshad Mobasher and Jaideep Srivastava, Data
Preparation for Mining World Wide Web Browsing Patterns,
Knowledge and Information Systems, Vol. 1. No. 1. 1999, pp. 5-32
[10] M. Spiliopoulou and C. Pohle and L. Faulstich, Improving the
Effectiveness of a Web Site with Web Usage Mining, WEBKDD '99:
Revised Papers from the International Workshop on Web Usage
Analysis and User Profiling, 2000. pp. 142-162.
[11] M. Gery, H. Haddad: "Evaluation of Web Usage Mining Approaches for
User-s Next Request Prediction", Fifth International Workshop on Web
Information and Data Management (WIDM'03), 2003. pp. 74-81.
[12] O. Nasraoui, H. Frigui, A. Joshi, and R. Krishnapuram, Mining Web
Access Logs Using Relational Competitive Fuzzy Clustering, Eight
International Fuzzy Systems Association World Congress - IFSA 99,
1999
[13] M. Spiliopoulou and B. Mobasher and B. Berendt and M. Nakagawa, A
Framework for the Evaluation of Session Reconstruction Heuristics in
Web Usage Analysis, INFORMS Journal on Computing, 15, 2003.
[14] T. Morzy, M. Wojciechowski, and M. Zakrzewicz. Web users clustering.
International Symposium on Computer and Information Sciences 2000.
[15] Brandt Dainow, ÔÇ×3rd Party Cookies Are Dead,", Web Analytics
Associations, http://www.webanalyticsassociation.org/en/art/?2
[16] W3C, Common Log Format,
http://www.w3.org/Daemon/User/Config/Logging.html
[17] Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating ecommerce
and data mining: Architecture and challenges. Data mining.
San Jose, CA: IEEE Computer Society.