An Efficient Framework to Build Up Malware Dataset

This research paper presents a framework on how to
build up malware dataset.Many researchers took longer time to
clean the dataset from any noise or to transform the dataset into a
format that can be used straight away for testing. Therefore, this
research is proposing a framework to help researchers to speed up
the malware dataset cleaningprocesses which later can be used for
testing. It is believed, an efficient malware dataset cleaning
processes, can improved the quality of the data, thus help to improve
the accuracy and the efficiency of the subsequent analysis. Apart
from that, an in-depth understanding of the malware taxonomy is
also important prior and during the dataset cleaning processes. A
new Trojan classification has been proposed to complement this
framework.This experiment has been conducted in a controlled lab
environment and using the dataset from VxHeavens dataset. This
framework is built based on the integration of static and dynamic
analyses, incident response method and knowledge database
discovery (KDD) processes.This framework can be used as the basis
guideline for malware researchers in building malware dataset.





References:
<p>[1] Al Shalabi, Luai., Syaaban, Zyad., &amp; Kasasbeh, Basel. (2006). Data
Mining: A Preprocessing Engine. Applied Science University,
Amman, Jordan (Electronic version). (Accessed 25 March 2013).
[2] Barreno, M., Bartlett, P. L., Chi, F. J., Joseph, A. D., Nelson, B.,
Rubinstein, B. I., ... &amp; Tygar, J. D. (2008, October). Open problems in
the security of learning. In Proceedings of the 1st ACM workshop on
Workshop on AISec(pp. 19-26). ACM.
[3] Dai, Jianyong., Guha, Ratan., &amp; Lee, Joohan. (2009). Efficient Virus
Detection Using Dynamic Instruction Sequences(Electronic version).
(Accessed 29 March 2013).URL:http://www.academypublisher.com
/jcp/vol04/no05/jcp0405405414.pdf
[4] Engels, Robert., Theusinger, Christiane. (1998). Using a Data Metric
for Preprocessing Advice for Data Mining Applications(Electronic
version).(Accessed 27 March 2013).URL:http://www.esis.no/people/
robert.engels/papers/engels_theusinger_ECAI98.pdf
[5] Graziano, M., Leita, C., &amp; Balzarotti, D. (2012, December). Towards
network containment in malware analysis systems. In Proceedings of
the 28th Annual Computer Security Applications Conference (pp.
339-348). ACM.
[6] Han, J., Kamber, M.(2000). Data Preprocessing(Electronic version).
(Accessed 28 March 2013).URL:http://www.cse.iitm.ac.in/~cs672/
Lectures/Data_Preprocessing.pdf.
[7] Is Linux really more secure than Windows? (2011) [online] Available
from:http://www.esecurityplanet.com/trends/article.php/3933491/Is-L
inux-Really-More-Secure-than-Windows.htm (accessed 29 March
2013).
[8] Mangarae, Aelphaeis.(2006) Trojan White Paper [Igniteds.NET],
Available from: http://igniteds. (Accessed 29 March 2013).
[9] Mertz, C.J. and Murphy, P.M. (1996). UCI Repository of machine
learning databases. University of California (Electronic version).
Available from: http://www.ics.uci.edu/~mlearn/MLRepository.htm
(Accessed 29 March 2013).
[10] Mohd Saudi, M., Cullen, A.J. and Woodward, M. (2011), Efficient
StakcertKdd Processes In Worm Detection, World Academy Of
Science, Engineering And Technology Journal, Issue 55, pp. 453-457.
[11] Mohd Saudi, Madihah. (2011). A New Model for Worms Detection
And Response (Electronic version). (Accessed 25 March 2012).
[12] Nataraj, Lakshmanan., Yegneswaran, Vinod., Porras, Phillip., &amp;
Zhang, Jian. (2011). A Comparative Assessment of Malware
Classification using Binary Texture Analysis and Dynamic
Analysis(Electronic version). (Accessed 26 March 2013).URL:
http://vision.ece.ucsb.edu/publications/aisec17-nataraj.pdf.
[13] Plusquellic, Jim.,(2008). Taxonomy of Trojans for IC Trust.
(Electronic version). (Accessed 13 May 2012). URL: http://
www.ece.unm.edu/~jimp/HOST/papers/Trojan_taxonomy.pdf.
[14] Rajendran, Jeyavijayan., (2011). Toward a Comprehensive and
Systematic Classification of Hardware Trojans. (Electronic version).
(Accesses 29 March 2013).
[15] Sembiring, Rahmat Widia., &amp; Mohamad Zain, Jasni. (2012). The
Design of Pre-Processing Multidimensional Data Based on Component
Analysis, Faculty of Computer System and Software Engineering,
Universiti Malaysia Pahang (Electronic version). (Accessed 29 March
2013).URL: http://umpir.ump.edu.my/1204/1/new1-20110414.pdf.
[16] Shafiq, M. Zubair.,Khayam, Syed Ali., &amp; Farooq, Muddassar. (2008).
Embedded Malware Detection using Markov n-grams(Electronic
version). (Accessed 29 March 2013).URL: http://nexginrc.org/
nexginrcAdmin/PublicationsFiles/dimva08-zubair.pdf.
[17] Stibor, Thomas. (2010). A Study Of Detecting Computer Viruses In
Real-Infected Files in the n-gram Representation with Machine
Learning Methods (Electronic version). (Accessed 29 March 2013).
URL:http://www.sec.in.tum.de/assets/staff/stibor/iea.aie.final.extende
d.pdf
[18] Tehranipoor, Mohammad.,Koushanfar, Farinaz, (2010). A Survey of
Hardware Trojan Taxonomy and Detection. . (Electronic version).
(Accessed 29 March 2013).URL: http://www.computer.org/
portal/web/computingnow/0910/theme/designandtest3
[19] Trojan Horse (2012) [online] Available from: www.webopedia.com
/TERM/T/Trojan_horse.html (Accessed 29 March 2013).
[20] Wang,Xiaoxiao.,Salmani, Hassan., Tehranipoor, Mohammad., and
Plusquellic, Jim. (2008). Hardware Trojan Detection and Isolation
Using Current Integration and Localized Current Analysis (Electronic
version).(Accessed on 29 March 2013) .URL:http://www.ece.unm.edu
/~jimp/pubs/DFT08_FINAL.pdf
[21] Witten, I. H., &amp; Frank, E. (2005). Data Mining: Practical machine
learning tools and techniques. Morgan Kaufmann.
[22] Zimmermann, Thomas., &amp; Wei&szlig;gerber, Peter. (2004). Preprocessing
CVS Data for Fine-Grained Analysis (Electronic version). (Accessed
26 March 2013).URL: http://msr.uwaterloo.ca/slides/Zimmermann.pdf
[23] Rad, B. B., Masrom M., Ibrahim, S. (2011). Evaluation of Computer
Virus Concealment and Antivirus Concealment and Anti-Virus
Techniques: A Short Surver.International Journal of Computer
Science Issues, 8(1).
[24] Gharibi, W., Mirza, Abdulrahman. Software Vulnerabilities, Banking
Threats, Botnets and Malware Self-Protection Technologies.
[25] Schultz, M. G., Eskin, E., Zadok, E. and Stolfo, S. J. (2001). Data
Mining Methods for Detection of New Malicious Executables. In
Proceedings of the 2001 IEEE Symposium on Security and Privacy,
IEEE Computer Society, pp 38, (Accessed 26 March 2013)
[26] Henchiri, O. and Japkowicz, N. (2006). A Feature Selection and
Evaluation Scheme for Computer Virus Detection. Proceedings of the
Sixth International Conference on Data Mining, 2006. ICDM &#39;06.
Hong Kong: IEEE Xplore, pp. 891 - 895. Available from:
http://doi.ieeecomputersociety.org/10.1109/ICDM.2006.4(Accessed
26 March 2013)
[27] Moskovitch, R., Y. Elovici and Rokach,L. (2008a). Detection of
unknown computer worms based on behavioral classification of the
host. Computational Statistics &amp; Data Analysis 52(9). pp.4544-4566.
[28] Khan, H., Mirza, F. and Khayam, S. A. (2010). Determining malicious
executable distinguishing attributes and low-complexity detection.
Journal in Computer Virology. 7(2), pp. 95-105
[29] Abuzaid, AM, Mohd Saudi, M. M Taib, B. &amp; Abdullah, ZH. (2013) An
Efficient Trojan Horse Classification (ETC), IJCSI International
Journal of Computer Science Issues, Vol. 10, Issue 2, No 3, March
2013, pp.96-104.</p>