Detecting Email Forgery using Random Forests
and Naïve Bayes Classifiers
As emails communications have no consistent
authentication procedure to ensure the authenticity, we present an
investigation analysis approach for detecting forged emails based on
Random Forests and Naïve Bays classifiers. Instead of investigating
the email headers, we use the body content to extract a unique writing
style for all the possible suspects. Our approach consists of four main
steps: (1) The cybercrime investigator extract different effective
features including structural, lexical, linguistic, and syntactic
evidence from previous emails for all the possible suspects, (2) The
extracted features vectors are normalized to increase the accuracy
rate. (3) The normalized features are then used to train the learning
engine, (4) upon receiving the anonymous email (M); we apply the
feature extraction process to produce a feature vector. Finally, using
the machine learning classifiers the email is assigned to one of the
suspects- whose writing style closely matches M. Experimental
results on real data sets show the improved performance of the
proposed method and the ability of identifying the authors with a
very limited number of features.
[1] T. McElroy and J. J. Seta, "Framing the frame: How task goals
determine the likelihood and direction of framing effects," Judgment and
Decision Making, Vol. 2 (4), Aug 2007, pp. 251-256.
[2] F. Iqbal, R. Hadjidj, B.C.M. Fung, M. Debbabi, "A novel approach of
mining write-prints for authorship attribution in email forensics," Digital
Investigation, Vol. 5 (1), 2008, pp. 42-51.
[3] O. De Vel, A. Anderson, M. Corney, and G. Mohay, "Mining Email
Content for Author Identification Forensics", SIGMOD Record, Vol.
30(4), 2001, pp. 55-64.
[4] A. Gray, P. Sallis, and S. MacDonell, "Software Forensics: Extending
Authorship Analysis Techniques to Computer Programs," in the 3rd
Biannual Conference International Association of Forensic Linguists,
1997.
[5] M. Koppel, S. Argamon, and A.R. Shimoni, "Automatically categorizing
written texts by author gender," Literary and Linguistic Computing, Vol.
17(4), 2002, pp. 401-412.
[6] A. Abbasi, and H. Chen, "Writeprints: A stylometric approach to
identity-level identification and similarity detection in cyberspace,"
ACM Transactions on Information Systems, Vol. 26(2), March 2008,
pp. 1-29.
[7] M. Koppel, J. Schler, and S. Argamon, "Computational methods in
authorship attribution," Journal of the American Society for Information
Science and Technology, Vol. 60(1), 2009, pp. 9-26.
[8] R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for authorship
identification of online messages: Writing-style features and
classification techniques," Journal of the American Society for
Information Science and Technology, Vol. 57(3), February 2006, pp.
378-393,.
[9] F. Iqbal, H. Binsalleeh, B.C.M. Fung, and M. Debbabi, "Mining
writeprints from anonymous emails for forensic investigation," Digital
Investigation, 2010, pp. 1-9.
[10] L. Breiman, "Random forests," Machine Learning, 2001, pp. 5-32.
[11] P. Domingos and M. Pazzani, "On the optimality of the simple Bayesian
classifier under zero-one loss," Machine Learning, 2001, pp. 103-137.
[12] DJ. Hand and K. Yu, "Idiot's Bayes - not so stupid after all?,"
International Statistical Review, Vol. 69(3), 2001, pp. 385-399.
[13] L. Kaelbling, "Enron email dataset," CALO Project,
http://www.cs.cmu.edu/enron/, August 21 2009.
[14] I. Witten and E. Frank, "Data Mining: Practical Machine Learning Tools
and Techniques," Margan Kaufmann, San Francisco, 2nd edition, 2005.
[1] T. McElroy and J. J. Seta, "Framing the frame: How task goals
determine the likelihood and direction of framing effects," Judgment and
Decision Making, Vol. 2 (4), Aug 2007, pp. 251-256.
[2] F. Iqbal, R. Hadjidj, B.C.M. Fung, M. Debbabi, "A novel approach of
mining write-prints for authorship attribution in email forensics," Digital
Investigation, Vol. 5 (1), 2008, pp. 42-51.
[3] O. De Vel, A. Anderson, M. Corney, and G. Mohay, "Mining Email
Content for Author Identification Forensics", SIGMOD Record, Vol.
30(4), 2001, pp. 55-64.
[4] A. Gray, P. Sallis, and S. MacDonell, "Software Forensics: Extending
Authorship Analysis Techniques to Computer Programs," in the 3rd
Biannual Conference International Association of Forensic Linguists,
1997.
[5] M. Koppel, S. Argamon, and A.R. Shimoni, "Automatically categorizing
written texts by author gender," Literary and Linguistic Computing, Vol.
17(4), 2002, pp. 401-412.
[6] A. Abbasi, and H. Chen, "Writeprints: A stylometric approach to
identity-level identification and similarity detection in cyberspace,"
ACM Transactions on Information Systems, Vol. 26(2), March 2008,
pp. 1-29.
[7] M. Koppel, J. Schler, and S. Argamon, "Computational methods in
authorship attribution," Journal of the American Society for Information
Science and Technology, Vol. 60(1), 2009, pp. 9-26.
[8] R. Zheng, J. Li, H. Chen, and Z. Huang, "A framework for authorship
identification of online messages: Writing-style features and
classification techniques," Journal of the American Society for
Information Science and Technology, Vol. 57(3), February 2006, pp.
378-393,.
[9] F. Iqbal, H. Binsalleeh, B.C.M. Fung, and M. Debbabi, "Mining
writeprints from anonymous emails for forensic investigation," Digital
Investigation, 2010, pp. 1-9.
[10] L. Breiman, "Random forests," Machine Learning, 2001, pp. 5-32.
[11] P. Domingos and M. Pazzani, "On the optimality of the simple Bayesian
classifier under zero-one loss," Machine Learning, 2001, pp. 103-137.
[12] DJ. Hand and K. Yu, "Idiot's Bayes - not so stupid after all?,"
International Statistical Review, Vol. 69(3), 2001, pp. 385-399.
[13] L. Kaelbling, "Enron email dataset," CALO Project,
http://www.cs.cmu.edu/enron/, August 21 2009.
[14] I. Witten and E. Frank, "Data Mining: Practical Machine Learning Tools
and Techniques," Margan Kaufmann, San Francisco, 2nd edition, 2005.
@article{"International Journal of Information, Control and Computer Sciences:52455", author = "Emad E Abdallah and A.F. Otoom and ArwaSaqer and Ola Abu-Aisheh and Diana Omari and Ghadeer Salem", title = "Detecting Email Forgery using Random Forests
and Naïve Bayes Classifiers", abstract = "As emails communications have no consistent
authentication procedure to ensure the authenticity, we present an
investigation analysis approach for detecting forged emails based on
Random Forests and Naïve Bays classifiers. Instead of investigating
the email headers, we use the body content to extract a unique writing
style for all the possible suspects. Our approach consists of four main
steps: (1) The cybercrime investigator extract different effective
features including structural, lexical, linguistic, and syntactic
evidence from previous emails for all the possible suspects, (2) The
extracted features vectors are normalized to increase the accuracy
rate. (3) The normalized features are then used to train the learning
engine, (4) upon receiving the anonymous email (M); we apply the
feature extraction process to produce a feature vector. Finally, using
the machine learning classifiers the email is assigned to one of the
suspects- whose writing style closely matches M. Experimental
results on real data sets show the improved performance of the
proposed method and the ability of identifying the authors with a
very limited number of features.", keywords = "Digital investigation, cybercrimes, emails forensics,
anonymous emails, writing style, and authorship analysis", volume = "6", number = "3", pages = "298-5", }