Classification Influence Index and its Application for k-Nearest Neighbor Classifier

Classification is an important topic in machine learning and bioinformatics. Many datasets have been introduced for classification tasks. A dataset contains multiple features, and the quality of features influences the classification accuracy of the dataset. The power of classification for each feature differs. In this study, we suggest the Classification Influence Index (CII) as an indicator of classification power for each feature. CII enables evaluation of the features in a dataset and improved classification accuracy by transformation of the dataset. By conducting experiments using CII and the k-nearest neighbor classifier to analyze real datasets, we confirmed that the proposed index provided meaningful improvement of the classification accuracy.

Authors:



References:
[1] Y. Saeys, I. Inza, and P. Larranaga, A review of feature selection
techniques in bioinformatics, Bioinformatics, 23, 2007, pp.2507-2517.
[2] G. Bhanot, G. Alexe, and B. Venkataraghavan, A robust meta classification strategy for cancer detection from MS data, Proteomics, 6,
2006, pp.592-604.
[3] P. Jafari, and F. Azuaje, An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors, BMC Med Inform Decis Mak, 6, 2006, p. 27.
[4] H. W. Ressom, R. S. Varghese, and S. K. Drake, Peak selection from
MALDI-TOF mass spectra using ant colony optimization,
Bioinformatics, 23, 2007, pp. 619-626.
[5] Y. Sun, and D. Wu, A RELIEF Based Feature Extraction Algorithm, in
Proceedings of the 2008 SIAM International Conference on Data
Mining, 2008, pp.188-195.
[6] X. Cui, H. Zhao, and J. Wilson, Optimized Ranking and Selection
Methods for Feature Selection with Application in Microarray Experiments, J Biopha Stat, 20, 2010, pp.223-239.
[7] S. Xu, Q. Luo, and H. Li, Time Series Classification Based on Attributes
Weighted Sample Reducing KNN, Proceedings of the 2009 Second
International Symposium on Electronic Commerce and Security, 2009, pp.194-199.
[8] Y. Liao, and X. Pan, A New Method of Training Sample Selection in
Text Classification, 2010 Second International Workshop on Education
Technology and Computer Science, 2010, pp.211-214.
[9] Y. Xu, L. Zhen, and L. Yang, Classification Algorithm Based on Feature
Selection and Samples Selection, Lecture Notes in Computer Science,
5552, 2009, pp.631-638.
[10] B. T. McBride, and G. L. Peterson, Blind Data Classification using
Hyper-Dimensional Convex Polytopes, Proceedings of 17th International FLAIRS conference, 2004, pp.1-6.
[11] J. Schuchhardt, D. Beule, and A. Malik, Normalization strategies for cDNA microarrays, Nucleic Acids Research, 28, 2000, E47-e47.
[12] W. Wu, E. P. Xing, and Connie Myers, Evaluation of normalization
methods for cDNA micro-array data by k-NN classification, BMC
Bioinformatics, 6, 2005, p.191.
[13] G. Collewet, M. Strzelecki, and F. Mariette, Influence of MRI
acquisition protocols and image intensity normalization methods on texture classification, Magnetic Resonance Imaging, 22, 2004, pp.81-91.
[14] S. Oh, A New Feature Evaluation Method Based on Category Overlap,
Computers in Biology and Medicine, 41, 2011, pp.115-122.
[15] J. Liang, S. Yang, A. Winstanley, Invariant optimal feature selection: A
distance discriminant and feature ranking based solution, Pattern
Recogn., 41, 2008, pp.1429-1439.
[16] C. Ding, H. Peng, Minimum Redundancy Feature Selection from
Microarray Gene Expression Data, Proceedings of the IEEE Computer
Society Conference on Bioinformatics, 2003, p.523.