Image Spam Detection Using Color Features and K-Nearest Neighbor Classification

Image spam is a kind of email spam where the spam
text is embedded with an image. It is a new spamming technique
being used by spammers to send their messages to bulk of internet
users. Spam email has become a big problem in the lives of internet
users, causing time consumption and economic losses. The main
objective of this paper is to detect the image spam by using histogram
properties of an image. Though there are many techniques to
automatically detect and avoid this problem, spammers employing
new tricks to bypass those techniques, as a result those techniques are
inefficient to detect the spam mails. In this paper we have proposed a
new method to detect the image spam. Here the image features are
extracted by using RGB histogram, HSV histogram and combination
of both RGB and HSV histogram. Based on the optimized image
feature set classification is done by using k- Nearest Neighbor(k-NN)
algorithm. Experimental result shows that our method has achieved
better accuracy. From the result it is known that combination of RGB
and HSV histogram with k-NN algorithm gives the best accuracy in
spam detection.





References:
[1] K.M. Svore, Q. Wu, and C. J. Burges, "Improving web spam
classification using rank-time features", Proceedings of the 3rd
International Workshop on Adversarial Information Retrieval on the
Web (AIRWeb’07), Banff, Alberta, Canada, pp. 9–16, 2007.
[2] G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li, "Detecting link
spam using temporal information", Proceedings of the 6th IEEE
International Conference on Data Mining (ICDM’06), Hong Kong,
China, pp. 1049–1053, 2006.
[3] M. Egele, C. Kolbitsch, and C. Platzer, "Removing web spam links from
search engine results", Journal in Computer Virology, vol. 7, pp. 51–62,
2011.
[4] Marc Najork, Web Spam Detection. Microsoft Research, Mountain
View, CA, USA.
[5] M. Hu & B. Liu, "Mining and summarizing customer reviews", KDD’
2004. [6] B. Liu, "Web Data Mining", Springer, 2007.
[7] Z. Gyongyi& H. Garcia-Molina, "Web Spam Taxonomy. Technical
Report" Stanford University, 2004.
[8] K. Li, & Z. Zhong, "Fast statistical spam filter by approximate
classifications", SIGMETRICS, 2006.
[9] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, "Detecting spam
web pages through content analysis", Proceedings of the World Wide
Web conference (WWW’06), Edinburgh, Scotland, pp. 83–92, 2006.
[10] B. Wu, V. Goel& B.D. Davison, "Topical Trust Rank: using topicality to
combat Web spam", WWW'2006.
[11] T. Almeida, A. Yamakami, and J. Almeida, " Evaluation of Approaches
for Dimensionality Reduction Applied with Naive Bayes Anti-Spam
Filters", Proceedings of the 8th IEEE International Conference on
Machine Learning and Applications, Miami, FL, USA, pp. 517–
522,2009.
[12] T. Almeida and A. Yamakami, " Content-Based Spam Filtering",
Proceedings of the 23rd IEEE International Joint Conference on
Neural Networks, Barcelona, Spain, pp. 1–7.2010.
[13] T. Almeida, J. Almeida, and A. Yamakami, "Spam Filtering: How the
Dimensionality Reduction Affects the Accuracy of Naive Bayes
Classifiers", Journal of Internet Services and Applications, vol. 1, no. 3,
pp. 183–200, 2011.
[14] Q. Gan and T. Suel, "Improving web spam classifiers using link
structure", Proceedings of the 3rd international Workshop on
Adversarial Information Retrieval on the Web (AIRWeb’07), Banff,
Alberta, Canada, pp. 17–20, 2007.
[15] T. Urvoy, E. Chauveau, and P. Filoche, "Tracking web spam with html
style similarities", ACM Transactions on the Web, vol. 2, no. 1, pp.1–3,
2008.