A Comparative Study of Malware Detection Techniques Using Machine Learning Methods

In the past few years, the amount of malicious software increased exponentially and, therefore, machine learning algorithms became instrumental in identifying clean and malware files through (semi)-automated classification. When working with very large datasets, the major challenge is to reach both a very high malware detection rate and a very low false positive rate. Another challenge is to minimize the time needed for the machine learning algorithm to do so. This paper presents a comparative study between different machine learning techniques such as linear classifiers, ensembles, decision trees or various hybrids thereof. The training dataset consists of approximately 2 million clean files and 200.000 infected files, which is a realistic quantitative mixture. The paper investigates the above mentioned methods with respect to both their performance (detection rate and false positive rate) and their practicability.




References:
[1] Mihai Cimpoesu, Dragos Gavrilut, and Adrian Popescu. The proactivity
of perceptron derived algorithms in malware detection. Journal in
Computer Virology, 8(4):133–140, 2012.
[2] Pedro Domingos. Metacost: A general method for making classifiers
cost-sensitive. In Proceedings of the Fifth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, San Diego, CA,
USA, August 15-18, 1999, pages 155–164, 1999.
[3] Dragos Gavrilut, Razvan Benchea, and Cristina Vatamanu. Optimized
zero false positives perceptron training for malware detection. In
14th International Symposium on Symbolic and Numeric Algorithms for
Scientific Computing, SYNASC 2012, Timisoara, Romania, September
26-29, 2012, pages 247–253, 2012.
[4] Dragos Gavrilut, Mihai Cimpoesu, Dan Anton, and Liviu Ciortuz.
Malware detection using machine learning. In Proceedings of the
International Multiconference on Computer Science and Information
Technology, IMCSIT 2009, Mragowo, Poland, 12-14 October 2009,
pages 735–741, 2009.
[5] Yongtao Hu, Liang Chen, Ming Xu, Ning Zheng, and Yanhua Guo.
Unknown malicious executables detection based on run-time behavior.
In Fifth International Conference on Fuzzy Systems and Knowledge
Discovery, FSKD 2008, 18-20 October 2008, Jinan, Shandong, China,
Proceedings, Volume 4, pages 391–395, 2008.
[6] Aleksander Kocz and Joshua Alspector. Svm-based filtering of
e-mail spam with content-specific misclassification costs. In
IN PROCEEDINGS OF THE WORKSHOP ON TEXT MINING
(TEXTDM2001, 2001.
[7] Jeremy Z. Kolter and Marcus A. Maloof. Learning to detect and
classify malicious executables in the wild. Journal of Machine Learning
Research, 6:2721–2744, 2006.
[8] Yi-Bin Lu, Shu-Chang Din, Chao-Fu Zheng, and Bai-Jian Gao. Using
multi-feature and classifier ensembles to improve malware detection.
Journal of C.C.I.T., 39(2), 2010.
[9] Thomas R. Lynam, Gordon V. Cormack, and David R. Cheriton. On-line
spam filter fusion. In SIGIR 2006: Proceedings of the 29th Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval, Seattle, Washington, USA, August 6-11, 2006,
pages 123–130, 2006.
[10] Eitan Menahem, Asaf Shabtai, Lior Rokach, and Yuval Elovici.
Improving malware detection by applying multi-inducer ensemble.
Computational Statistics & Data Analysis, 53(4):1483–1494, 2009.
[11] Robert Moskovitch, Yuval Elovici, and Lior Rokach. Detection of
unknown computer worms based on behavioral classification of the host.
Computational Statistics & Data Analysis, 52(9):4544–4566, 2008.
[12] Robert Moskovitch, Clint Feher, Nir Tzachar, Eugene Berger, Marina
Gitelman, Shlomi Dolev, and Yuval Elovici. Unknown malcode detection
using OPCODE representation. In Intelligence and Security Informatics,
First European Conference, EuroISI 2008, Esbjerg, Denmark, December
3-5, 2008. Proceedings, pages 204–215, 2008.
[13] Mehmet Ozdemir and Ibrahim Sogukpinar. An android malware
detection architecture based on ensemble learning. Transactions on
Machine Learning and Artificial Intelligence, 2(3), 2014.
[14] Matthew G. Schultz, Eleazar Eskin, Erez Zadok, and Salvatore J. Stolfo.
Data mining methods for detection of new malicious executables. In
2001 IEEE Symposium on Security and Privacy, Oakland, California,
USA May 14-16, 2001, pages 38–49, 2001.
[15] Dong-Her Shih, Hsiu-Sen Chiang, and David C. Yen. Classification
methods in the detection of new malicious emails. Inf. Sci.,
172(1-2):241–261, 2005.
[16] Konstantin Tretyakov. Machine learning techniques in spam filtering.
Data Mining Problem-oriented Seminar, 3(177):60–79, 2004.
[17] Wen-tau Yih, Joshua Goodman, and Geoff Hulten. Learning at low false
positive rates. In CEAS 2006 - The Third Conference on Email and
Anti-Spam, July 27-28, 2006, Mountain View, California, USA, 2006.
[18] Boyun Zhang, Jianping Yin, Jingbo Hao, Dingxing Zhang, and Shulin
Wang. Malicious codes detection based on ensemble learning. In
Autonomic and Trusted Computing, 4th International Conference, ATC
2007, Hong Kong, China, July 11-13, 2007, Proceedings, pages
468–477, 2007.