A Distance Function for Data with Missing Values and Its Application

Missing values in data are common in real world applications. Since the performance of many data mining algorithms depend critically on it being given a good metric over the input space, we decided in this paper to define a distance function for unlabeled
datasets with missing values. We use the Bhattacharyya distance, which measures the similarity of two probability distributions, to define our new distance function. According to this distance, the distance between two points without missing attributes values is simply the Mahalanobis distance. When on the other hand there is a missing value of one of the coordinates, the distance is computed according to the distribution of the missing coordinate. Our distance is general and can be used as part of any algorithm that computes the distance between data points. Because its performance depends strongly on the chosen distance measure, we opted for the k nearest neighbor classifier to evaluate its ability to accurately reflect object similarity. We experimented on standard numerical datasets from the UCI repository from different fields. On these datasets we simulated missing values and compared the performance of the kNN classifier using our distance to other three basic methods. Our  experiments show that kNN using our distance function outperforms the kNN using other methods. Moreover, the runtime performance of our method is only slightly higher than the other methods.





References:
[1] Gustavo Batista and Maria Carolina Monard. An analysis of four missing
data treatment methods for supervised learning. Applied Artificial
Intelligence, 17(5-6):519–533, 2003.
[2] Knonenko. Bratko and E. I. Roskar. Experiments in automatic learning
of medical diagnostic rules. Technical Report, Jozef Stefan Institute,
Lljubljana, Yugoslavia, 1984.
[3] Krzysztof J Cios and Lukasz A Kurgan. Trends in data mining and
knowledge discovery. Advanced techniques in knowledge discovery and
data mining, pages 1–26, 2005.
[4] Peter Clark and Tim Niblett. The cn2 induction algorithm. Machine
learning, 3(4):261–283, 1989.
[5] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEE
Transactions on Information Theory, 13(1):21–27, 1967.
[6] A. Frank and A. Asuncion. UCI machine learning repository, 2010.
[7] Jerzy Grzymala-Busse and Ming Hu. A comparison of several approaches
to missing attribute values in data mining. In Rough sets and
current trends in computing, pages 378–385. Springer, 2001.
[8] Matteo Magnani. Techniques for dealing with missing data in knowledge
discovery tasks. Obtido http://magnanim.web.cs.unibo.it/index.html,
15(01):2007, 2004.
[9] Nambiraj Suguna and Keppana G Thanushkodi. Predicting missing
attribute values using k-means clustering. Journal of Computer Science,
7(2):216–224, 2011.
[10] Shichao Zhang. Shell-neighbor method and its application in missing
data imputation. Applied Intelligence, 35(1):123–133, 2011.