Feature Selection Approaches with Missing Values Handling for Data Mining - A Case Study of Heart Failure Dataset

In this paper, we investigated the characteristic of a clinical dataseton the feature selection and classification measurements which deal with missing values problem.And also posed the appropriated techniques to achieve the aim of the activity; in this research aims to find features that have high effect to mortality and mortality time frame. We quantify the complexity of a clinical dataset. According to the complexity of the dataset, we proposed the data mining processto cope their complexity; missing values, high dimensionality, and the prediction problem by using the methods of missing value replacement, feature selection, and classification.The experimental results will extend to develop the prediction model for cardiology.




References:
[1] A. K. Tanwani, M. J. Afridi, M. Z. Shafiq, M. Farooq: Guidelines to
Select Machine Learning Scheme for Classification of Biomedical
Datasets. EvoBIO 2009: 128-139
[2] N. Zhou, L. Wang, "A Modified T-test Feature Selection Method and Its
Application on the HapMap Genotype Data," Genomics, Proteomics &
Bioinformatics, 5(3-4), pp. 242-249, 2007.
[3] U. Fayyad, K. Irani, "Multi-interval discretization of continuous-valued
attributes for classication learning,"In: 13th International Joint
Conference on Artificial Intelligencepp. 1022-1029, 1993.
[4] H.Liu, J.Li , L. Wong, "A comparative study on feature selection and
classification methods using gene expression profiles and proteomic
patterns," Genome Informatics, 13, 2002, pp. 51-60.
[5] C.-N.Hsu, H.-J.Huang, D. Schuschel, "The ANNIGMA-wrapper
approach to fast feature selection for Neural Nets," IEEE Transactions
Systems, Man and Cybernetics, Part B, 2002, pp. 1-6.
[6] Heart Failure Society of, A. (2010). "Section 2: Conceptualization and
Working Definition of Heart Failure." Journal of cardiac failure 16(6):
e34-e37.
[7] W. B. Kannel, R. B. D'Agostino, H. Silbershatz, et al. "Profile for
estimating risk of heart failure," Arch Intern Med 1999;159:1197-204.
[8] E. Acuna, C. Rodriguez, "The treatment of missing values and its effect
in the classifier accuracy," In: Banks, D., House, L., McMorris, F.R.,
Arabie, P., Gaul, W. (Eds.), Classification, Clustering and Data Mining
Applications, Springer, Berlin, Heidelberg. pp. 639-648.
[9] J.-H. Lin, P. J. Haug, "Data Preparation Framework for Preprocessing
Clinical Data in Data Mining," AMIA Annual Symposium proceedings
AMIA Symposium AMIA Symposium, 2006, 489-493.
[10] L.Yu, H. Liu, "Efficient Feature Selection via Analysis of Relevance
and Redundancy," Machine Learning Research, 5, pp. 1205-1224, 2004.
[11] T.Jirapech-Umpai,S. Aitken, "Feature selection and classification for
microarray data analysis: Evolutionary methods for identifying
predictive genes," BMC Bioinformatics, 6, 148, 2005.
[12] R. J.Harris, "A Primer of Multivariate Statistics, 3rd ed., New Jersey :
Lawrence Erlbaum Associates, 2001.
[13] S.Li,C.Liao,J. T.Kwok, "Gene Feature Extraction Using t-Test Statistics
and Kernel Partial Least Squares," ICONIP, 3, pp. 11-20, 2006.
[14] L.Wang, F.Chu, W.Xie, "Accurate Cancer Classification Using
Expressions of Very Few Genes," IEEE/ACM Transactions on
Computational Biology and Bioinformatics, pp. 40-53, 2007.
[15] D. W. Aha, R. L.Bankert, "A Comparative Evaluation of Sequential
Feature Selection Algorithms," In: Fifth International Workshop
onArtificial Intelligence and Statistics, pp. 1-7, 1995.
[16] Analysis Factor, "EM Imputation and Missing Data: Is Mean Imputation
Really so Terrible?," [Online], 15 April 2009, (URL
http://www.analysisfactor.com/statchat/tag/spss-missing-valuesanalysis/)(
Accessed 30August 2011).
[17] E.-L. Silva-Ramírez, R. Pino-Mejías, M. López-Coello, M.-D. Cubilesde-
la-Vega, "Missing value imputation on missing completely at random
data using multilayer perceptrons," Neural Networks, 24,1, 121-129,
2011.
[18] The University of Waikato, "WEKA: The Waikato Environment for
Knowledge Acquisition," [Online],(URL
http://www.cs.waikato.ac.nz/ml/weka/)(Accessed 30August 2011).
[19] F. Coetzee, "Correcting the Kullback-Leibler distance for feature
selection", presented at Pattern Recognition Letters, 2005, pp.1675-
1683.
[20] A.-N. Yahya, M. G. Kevin, Z. Jufen, G.F. C. John, L. C. Andrew, "Red
cell distribution width: an inexpensive and powerful prognostic marker
in heart failure,"European Journal Heart Failure,vol. 11,pp. 1155-1162,
2009.
[21] Atherotech Diagnotics Lab, "Atherotech Panels," [Online], (URL
http://www.atherotech.com/athdiagtests/atherotechpanels.asp),
(Accessed 13 June 2011).