An Analysis of Classification of Imbalanced Datasets by Using Synthetic Minority Over-Sampling Technique

Analysing unbalanced datasets is one of the challenges that practitioners in machine learning field face. However, many researches have been carried out to determine the effectiveness of the use of the synthetic minority over-sampling technique (SMOTE) to address this issue. The aim of this study was therefore to compare the effectiveness of the SMOTE over different models on unbalanced datasets. Three classification models (Logistic Regression, Support Vector Machine and Nearest Neighbour) were tested with multiple datasets, then the same datasets were oversampled by using SMOTE and applied again to the three models to compare the differences in the performances. Results of experiments show that the highest number of nearest neighbours gives lower values of error rates. 




References:
[1] Chawla, N.V., et al., SMOTE: Synthetic Minority Over-Sampling
Technique. Journal of Artificial Intelligence Research, 2002. 16: p. 321-
357.
[2] Alpaydin, E., Introduction to Machine Learning. 2009: Massachusetts
Institute of Technology.
[3] Dong, Y. and X. Wang, A New Over-Sampling Approach: RandomSMOTE
for Learning from Imbalanced Data Sets. KSEM'11
Proceedings of the 5th international conference on Knowledge Science,
Engineering and Management, 2011: p. 343-352.
[4] Kohavi, R. and F. Provost. Glossary of Terms: Special Issue on
Applications of Machine Learning and the Knowledge Discovery
Process. 1998 (cited 2016); Available from:
http://robotics.stanford.edu/~ronnyk/glossary.html.
[5] Bland, J.M. and D.G. Altman, Measurement error. British Medical
Journal, 1996. 313: p. 744.
[6] Fawcett, T., An introduction to ROC analysis. Pattern Recognition
Letters - Special issue: ROC analysis in pattern recognition, 2006. 27(8):
p. 861-874.
[7] Brain, D. and G.I. Webb, On the effect of data set size on bias and
variance in classification learning. The Fourth Australian Knowledge
Acquisition Workshop, 1999: p. 117-128.