Feature Selection for Web Page Classification Using Swarm Optimization

The web’s increased popularity has included a huge
amount of information, due to which automated web page
classification systems are essential to improve search engines’
performance. Web pages have many features like HTML or XML
tags, hyperlinks, URLs and text contents which can be considered
during an automated classification process. It is known that Webpage
classification is enhanced by hyperlinks as it reflects Web page
linkages. The aim of this study is to reduce the number of features to
be used to improve the accuracy of the classification of web pages. In
this paper, a novel feature selection method using an improved
Particle Swarm Optimization (PSO) using principle of evolution is
proposed. The extracted features were tested on the WebKB dataset
using a parallel Neural Network to reduce the computational cost.





References:
[1] Mangai, J. A., & Kumar, V. S. (2011). A Novel Approach for Web Page
Classification using Optimum. IJCSNS, 11(5), 252.
[2] X. Qi and B. D. Davison, “Web page classification: features and
algorithms,” ACM Computing Surveys, vol. 41, no. 2, article 12, 2009.
[3] T. M. Mitchell, Machine Learning, McGraw-Hill, NewYork, NY, USA,
1st edition, 1997.
[4] Golub, K. and A. Ardo (2005, September). Importance of HTML
structural elements and metadata in automated subject classification. In
Proceedings of the 9th European Conference on Research and Advanced
Technology for Digital Libraries (ECDL), Volume 3652 of LNCS,
Berlin, pp. 368–378. Springer.
[5] C. E. Shannon, “A mathematical theory of communication,” The Bell
System Technical Journal, vol. 27, pp. 379–423, 1948.
[6] Y. Yang and J. O. Pedersen, “A comparative study on feature selection
in text categorization,” in Proceedings of the 14th International
Conference on Machine Learning (ICML ’97), pp. 412–420, Nashville,
Tenn, USA, July 1997.
[7] W. J. Wilbur and K. Sirotkin, “The automatic identification of stop
words,” Journal of Information Science, vol. 18,no. 1, pp. 45–55, 1992..
[8] Mangai, J. A., & Kumar, V. S. (2011). A Novel Approach for Web Page
Classification using Optimum. IJCSNS, 11(5), 252.
[9] Song, R., Liu, H., Wen, J. R., & Ma, W. Y. (2004, May). Learning block
importance models for web pages. In Proceedings of the 13th
international conference on World Wide Web (pp. 203-211). ACM.
[10] Xhemali, D., Hinde, C. J., & Stone, R. G. (2009). Naive bayes vs.
decision trees vs. neural networks in the classification of training web
pages.
[11] Liu, R., Zhou, J., & Liu, M. (2006, October). Graph-based semisupervised
learning algorithm for web page classification. In Intelligent
Systems Design and Applications, 2006. ISDA'06. Sixth International
Conference on (Vol. 2, pp. 856-860). IEEE.
[12] Samarawickrama, S., & Jayaratne, L. (2012, September). Effect of
Named Entities in Web Page Classification. In Computational
Intelligence, Modelling and Simulation (CIMSiM), 2012 Fourth
International Conference on (pp. 38-42). IEEE.
[13] Saraç, E., & Ozel, S. A. (2013, June). Web page classification using
firefly optimization. In Innovations in Intelligent Systems and
Applications (INISTA), 2013 IEEE International Symposium on (pp. 1-
5). IEEE.
[14] Ozel, S. A. (2011, June). A genetic algorithm based optimal feature
selection for web page classification. In Innovations in Intelligent
Systems and Applications (INISTA), 2011 International Symposium on
(pp. 282-286). IEEE.
[15] Jebari, C., & Wani, M. A. (2012, December). A Multi-label and
Adaptive Genre Classification of Web Pages. In Machine Learning and
Applications (ICMLA), 2012 11th International Conference on (Vol. 1,
pp. 578-581). IEEE.
[16] He, Z., & Liu, Z. (2008, October). A Novel Approach to Naïve Bayes
Web Page Automatic Classification. In Fuzzy Systems and Knowledge
Discovery, 2008. FSKD'08. Fifth International Conference on (Vol. 2,
pp. 361-365). IEEE.
[17] Sun, A., Lim, E. P., & Ng, W. K. (2002, November). Web classification
using support vector machine. In Proceedings of the 4th international
workshop on Web information and data management (pp. 96-99). ACM.
[18] Kan, M. Y., &Thi, H. O. N. (2005, October). Fast webpage classification
using URL features. In Proceedings of the 14th ACM international
conference on Information and knowledge management (pp. 325-326).
ACM.
[19] Larkey, L. S., Ballesteros, L., & Connell, M. E. (2002, August).
Improving stemming for Arabic information retrieval: light stemming
and co-occurrence analysis. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in
information retrieval (pp. 275-282). ACM.
[20] Savoy, J. (1999). A stemming procedure and stopword list for general
French corpora. JASIS, 50(10), 944-952.
[21] Kraaij, W., & Pohlmann, R. (1994). Porter’s stemming algorithm for
Dutch. Informatiewetenschap, 167-180.
[22] Papineni, K. (2001, June). Why inverse document frequency?. In
Proceedings of the second meeting of the North American Chapter of the
Association for Computational Linguistics on Language technologies
(pp. 1-8). Association for Computational Linguistics.
[23] Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. (2000). Text
classification from labeled and unlabeled documents using EM. Machine
learning, 39(2), 103-134.
[24] Soucy, P., & Mineau, G. W. (2005, July). Beyond TFIDF weighting for
text categorization in the vector space model. In IJCAI (Vol. 5, pp.
1130-1135).
[25] Kennedy, J.; Eberhart, R.C., “A discrete binary version of the particle
swarm algorithm”, Systems, Man, and Cybernetics, 1997.
'Computational Cybernetics and Simulation'., 1997 IEEE International
Conference on Volume 5, 12-15 Oct. 1997 Page(s):4104 - 4108 vol.5.