Detecting Email Forgery using Random Forests and Naïve Bayes Classifiers

As emails communications have no consistent authentication procedure to ensure the authenticity, we present an investigation analysis approach for detecting forged emails based on Random Forests and Naïve Bays classifiers. Instead of investigating the email headers, we use the body content to extract a unique writing style for all the possible suspects. Our approach consists of four main steps: (1) The cybercrime investigator extract different effective features including structural, lexical, linguistic, and syntactic evidence from previous emails for all the possible suspects, (2) The extracted features vectors are normalized to increase the accuracy rate. (3) The normalized features are then used to train the learning engine, (4) upon receiving the anonymous email (M); we apply the feature extraction process to produce a feature vector. Finally, using the machine learning classifiers the email is assigned to one of the suspects- whose writing style closely matches M. Experimental results on real data sets show the improved performance of the proposed method and the ability of identifying the authors with a very limited number of features.




References:
[1] T. McElroy and J. J. Seta, "Framing the frame: How task goals
determine the likelihood and direction of framing effects," Judgment and
Decision Making, Vol. 2 (4), Aug 2007, pp. 251-256.
[2] F. Iqbal, R. Hadjidj, B.C.M. Fung, M. Debbabi, "A novel approach of
mining write-prints for authorship attribution in email forensics," Digital
Investigation, Vol. 5 (1), 2008, pp. 42-51.
[3] O. De Vel, A. Anderson, M. Corney, and G. Mohay, "Mining Email
Content for Author Identification Forensics", SIGMOD Record, Vol.
30(4), 2001, pp. 55-64.
[4] A. Gray, P. Sallis, and S. MacDonell, "Software Forensics: Extending
Authorship Analysis Techniques to Computer Programs," in the 3rd
Biannual Conference International Association of Forensic Linguists,
1997.
[5] M. Koppel, S. Argamon, and A.R. Shimoni, "Automatically categorizing
written texts by author gender," Literary and Linguistic Computing, Vol.
17(4), 2002, pp. 401-412.
[6] A. Abbasi, and H. Chen, "Writeprints: A stylometric approach to
identity-level identification and similarity detection in cyberspace,"
ACM Transactions on Information Systems, Vol. 26(2), March 2008,
pp. 1-29.
[7] M. Koppel, J. Schler, and S. Argamon, "Computational methods in
authorship attribution," Journal of the American Society for Information
Science and Technology, Vol. 60(1), 2009, pp. 9-26.
[8] R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for authorship
identification of online messages: Writing-style features and
classification techniques," Journal of the American Society for
Information Science and Technology, Vol. 57(3), February 2006, pp.
378-393,.
[9] F. Iqbal, H. Binsalleeh, B.C.M. Fung, and M. Debbabi, "Mining
writeprints from anonymous emails for forensic investigation," Digital
Investigation, 2010, pp. 1-9.
[10] L. Breiman, "Random forests," Machine Learning, 2001, pp. 5-32.
[11] P. Domingos and M. Pazzani, "On the optimality of the simple Bayesian
classifier under zero-one loss," Machine Learning, 2001, pp. 103-137.
[12] DJ. Hand and K. Yu, "Idiot's Bayes - not so stupid after all?,"
International Statistical Review, Vol. 69(3), 2001, pp. 385-399.
[13] L. Kaelbling, "Enron email dataset," CALO Project,
http://www.cs.cmu.edu/enron/, August 21 2009.
[14] I. Witten and E. Frank, "Data Mining: Practical Machine Learning Tools
and Techniques," Margan Kaufmann, San Francisco, 2nd edition, 2005.