Comparison of Imputation Techniques for Efficient Prediction of Software Fault Proneness in Classes

Missing data is a persistent problem in almost all areas of empirical research. The missing data must be treated very carefully, as data plays a fundamental role in every analysis. Improper treatment can distort the analysis or generate biased results. In this paper, we compare and contrast various imputation techniques on missing data sets and make an empirical evaluation of these methods so as to construct quality software models. Our empirical study is based on NASA-s two public dataset. KC4 and KC1. The actual data sets of 125 cases and 2107 cases respectively, without any missing values were considered. The data set is used to create Missing at Random (MAR) data Listwise Deletion(LD), Mean Substitution(MS), Interpolation, Regression with an error term and Expectation-Maximization (EM) approaches were used to compare the effects of the various techniques.




References:
[1] R.J.A Little, D.B. Rubin, Statistical Analysis with missing data, Wiley,
New York, 1987.
[2] D.B.Rubin, Multiple imputation for non response in surveys, Wiley, New
York, 1987.
[3] J.Schafer, Analysis of incomplete multivariate data: Chapman and Hall,
1997.
[4] F.Harrell,"Regression modelling Strategies: With Applications to Linear
Models, Logistic Regression, and Survival Analysis" Springer, New
York, 2001.
[5] P.D. Allison, Missing Data, SAGE Publication, Inc, 2001..
[6] C.M. Musil, C.B.Warner, P.K.Yobas, and S.L. Jones, "A Comparison of
Imputation Techniques for handling missing data," Western Journal of
Nursing Research, vol.24, no. 5,pp.815-829, 2002.
[7] E.G. Johnson, "Considerations and techniques for the analysis of NAEP
data," Journal of Educational Statistics, vol.14, pp.303-334,1989.
[8] C.J.Kaufman, "The application of logical imputation to household
measurement", Journal of the Market Research Society, vol.30, pp.453-
466, 1989.
[9] I.Myrtveit, E. Stensrud, and U.Olsson, "Analyzing Data Sets with
missing Data: An Empirical Evaluation of Imputation Methods and
Likelihood-Based Methods," IEEE Transactions on Software
Engineering, vol.27, no.11, pp.1999-1013, 2001.
[10] K.Strike, K.E.El-Emam, N.Madhavji, "Software Cost Estimation with
Incomplete Data," IEEE Transactions on Software Engineering, vol.27,
no.10,890-908, 2001.R. W. Lucky, "Automatic equalization for digital
communication," Bell Syst. Tech. J., vol. 44, no. 4, pp. 547-588, Apr.
1965.
[11] M.Cartwright, M.J.Shepperd, and Q.Song, "Dealing with Missing
Software Project data," In Proc. of the 9th Int. Symp. on Software
Metrics, pp.154-165, 2003.
[12] B.Twala, M.Cartwright, M.J. Shepperd, "Ensemble of Missing Data
Techniques to Improve Software Prediction Accracy," ICSE-06, 2006.
[13] B.Twala, "An Empirical Comparison of Techniques for handling
Incomplete Data using Decision Trees," Journal of Applied Artificial
Intelligence, vol.23, no. 5, pp.373-405, 2009.
[14] www.mdp.ivv.nasa.gov, NASA Metrics data Repository.