Improving Classification Accuracy with Discretization on Datasets Including Continuous Valued Features

This study analyzes the effect of discretization on classification of datasets including continuous valued features. Six datasets from UCI which containing continuous valued features are discretized with entropy-based discretization method. The performance improvement between the dataset with original features and the dataset with discretized features is compared with k-nearest neighbors, Naive Bayes, C4.5 and CN2 data mining classification algorithms. As the result the classification accuracies of the six datasets are improved averagely by 1.71% to 12.31%.




References:
[1] J. L. Lustgarten, V. Gopalakrishnan, H. Grover, and S. Visweswaran,
"Improving Classification Performance with Discretization on
Biomedical Datasets," in AMIA Annu Symp Proc., 2008, pp.445-449.
[2] K. J. Cios, W. Pedrycz, R. Swiniarski and L. Kurgan, "Data Mining A
Knowledge Discovery Approach," Springer, 2007.
[3] A. Kumar, D. Zhang, "Hand-Geometry Recognition Using Entropy-
Based Discretization," IEEE Transactions on Information Forenics and
Security, vol. 2, no. 2, 2007, pp. 181-187.
[4] U. M. Fayyad, K. B. Irani, "Multi-interval discretization of continuousvalued
attributes for classification learning," in Proc. 13th International
Joint Conference on Artificial Intelligence, San Francisco, CA, Morgan
Kaufmann, 1993, pp. 1022-1027.
[5] D. Dougherty, R. Kohavi, and M. Sahami, "Supervised and
unsupervised discretization of continuous features," in Proc. 12th Int.
Conf. Machine Learning, Tahoe City, CA, 1995, pp. 194-202.
[6] I. H. Witten and E. Frank, "Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations," San Mateo, CA:
Morgan Kaufman, 1999.
[7] G. Shakhnarovish, T. Darrell and P. Indyk, "Nearest-Neighbor Methods
in Learning and Vision," MIT Press, 2005.
[8] Y. Tsuruoka and J. Tsujii, "Improving the performance of dictionarybased
approaches in protein name recognition," Journal of Biomedical
Informatics, vol. 37, no. 6, December, 2004, pp. 461-470
[9] J. R. Quinlan, "C4.5: Programs for machine learning," San Francisco,
CA: Morgan Kaufman. 1993.
[10] P. Clark and T. Niblett, "The CN2 induction algorithm," Machine
Learning, 1989, vol. 3, pp. 261-284.
[11] N. Mastrogiannis, B. Boutsinas and I. Giannikos, "A method for
improving the accuracy of data mining classification
algorithms," Computers & Operations Research, 2009, vol. 36 no.10,
pp. 2829-2839.
[12] J. R. Quinlan, "Induction of C4.5 Decision trees," Machine Learning,
vol. 1, 1986, pp. 81-106.
[13] R. S. Michalski, "On the quasi-minimal solution of the general covering
problem," in Proceedings of the Fifth International Symposium on
Information Processing, 1969, Bled, Yugoslavia, pp. 125-128.
[14] J. R. Quinlan, "Learning efficient classification procedures and their
application to chess end games," Machine learning: An artificial
intelligence approach, 1983, Los Altos, CA: Morgan Kaufmann.