Layout Based Spam Filtering

Due to the constant increase in the volume of information available to applications in fields varying from medical diagnosis to web search engines, accurate support of similarity becomes an important task. This is also the case of spam filtering techniques where the similarities between the known and incoming messages are the fundaments of making the spam/not spam decision. We present a novel approach to filtering based solely on layout, whose goal is not only to correctly identify spam, but also warn about major emerging threats. We propose a mathematical formulation of the email message layout and based on it we elaborate an algorithm to separate different types of emails and find the new, numerically relevant spam types.


Authors:



References:
[1] J. B. MacQueen "Some Methods for classification and Analysis of
Multivariate Observations, Proceedings of 5-th Berkeley Symposium on
Mathematical Statistics and Probability", 1967 Berkeley, University of
California Press, 1:281-297
[2] P. L. Hammer "Distance-based classification methods", 1999, INFOR,
Canadian OR Society Vol.37, s. 337-352
[3] T. M. Cover. Estimation by the Nearest Neighbor Rule. IEEE
Transactions on Information Theory, IT-14(1):50--55, 1968
[4] E.S.Ristad, P.N.Yianilos "Learning String Edit Distance" (Online).
Available: http://www.pnylab.com/pny/papers/sed/sed.pdf
[5] P. Graham. A plan for spam., 2002 (Online). Available:
http://www.paulgraham.com/spam.html.
[6] H. Lee, A. Y. Hg "Spam Deobfuscation using a Hidden Markov Model,
2005 (Online). Available: http://ai.stanford.edu/~ang/papers/ceas05-
spamdeobfuscation.pdf
[7] C. Miller "Neural Network-based Antispam Heuristics", 2005 (Online).
Available: http://www.mnissa.
org/whitepapers/Symantec/AntiSpam%20Heuristics%20White%20P
apers.pdf .
[8] J. C. Burges "A Tutorial on Support Vector Machines for Pattern
Recognition" 1998 "Data Mining and Knowledge Discovery", 2, 121-
167, Kluwer Academic Publishers, Boston, USA.
[9] H. J. Mucha, H. Sofyan: "Nonhierarchical Clustering" ch.9.3. (Online).
Available:
http://www.quantlet.com/mdstat/scripts/xag/html/xaghtmlframe149.html
[10] P. Berkhin, "Survey of Clustering Data Mining Techniques", 2002,
Accrue Software, Available:
www.ee.ucr.edu/~barth/EE242/clustering_survey.pdf
[11] R. Ng, J. Han. "Efficient and effective clustering method for spatial data
mining", 1994. Proceedings of the 20th VLDB conference Santiago,
Chile, 144-155.
[12] J. Zhang, M. Zhu, D. Papadias, Y. Tao, D. L. Lee "Location-based
Spatial Queries" 2003, ACM SIGMOD San Diego, USA
[13] T.Seidl, H. P. Kriegel "Optimal Multi-Step k-Nearest Neighbor Search",
1996, ACM SIGMOD Seattle, USA
[14] U. Luxburg, O. Bousquet "Distance-Based Classification with Lipschitz
Functions", 2004, Journal of Machine Learning Research 5, 669-695
[15] S. Dixit, S. Gupta, C. V. Ravishankar "An Online Detection and Control
System for SMS Spam", 2005, Proceedings of the IASTED International
Conference Communication, Network and Information Security,
Phoenix, AZ, USA.
[16] R. M. Hayes, "Mathematical models in information retrieval", 1963
Natural Language and the Computer, McGraw-Hill, New York, USA.
[17] RFC 2045 (Online) Available: http://rfc.net/rfc2045.html