Distances over Incomplete Diabetes and Breast Cancer Data Based on Bhattacharyya Distance

Missing values in real-world datasets are a common
problem. Many algorithms were developed to deal with this
problem, most of them replace the missing values with a fixed
value that was computed based on the observed values. In
our work, we used a distance function based on Bhattacharyya
distance to measure the distance between objects with missing
values. Bhattacharyya distance, which measures the similarity of
two probability distributions. The proposed distance distinguishes
between known and unknown values. Where the distance between
two known values is the Mahalanobis distance. When, on the other
hand, one of them is missing the distance is computed based on the
distribution of the known values, for the coordinate that contains
the missing value. This method was integrated with Wikaya, a
digital health company developing a platform that helps to improve
prevention of chronic diseases such as diabetes and cancer. In order
for Wikaya’s recommendation system to work distance between users
need to be measured. Since there are missing values in the collected
data, there is a need to develop a distance function distances between
incomplete users profiles. To evaluate the accuracy of the proposed
distance function in reflecting the actual similarity between different
objects, when some of them contain missing values, we integrated it
within the framework of k nearest neighbors (kNN) classifier, since
its computation is based only on the similarity between objects. To
validate this, we ran the algorithm over diabetes and breast cancer
datasets, standard benchmark datasets from the UCI repository. Our
experiments show that kNN classifier using our proposed distance
function outperforms the kNN using other existing methods.




References:
[1] L. Abedallah and I. Shimshoni. A distance function for data with missing
values and its application. Proc. of the 13th Int. Conf. on Data Mining
and Knowledge Engineering, 2013.
[2] G. Batista and M.C. Monard. An analysis of four missing data treatment
methods for supervised learning. Applied Artificial Intelligence,
17(5-6):519–533, 2003.
[3] Krzysztof J Cios and Lukasz A Kurgan. Trends in data mining and
knowledge discovery. Advanced techniques in knowledge discovery and
data mining, pages 1–26, 2005.
[4] A Rogier T Donders, Geert JMG van der Heijden, Theo Stijnen, and
Karel GM Moons. Review: a gentle introduction to imputation of
missing values. Journal of clinical epidemiology, 59(10):1087–1091,
2006.
[5] A. Frank and A. Asuncion. UCI machine learning repository at
http://archive.ics.uci.edu/ml. visited (2013), 2010.
[6] Jerzy Grzymala-Busse and Ming Hu. A comparison of several
approaches to missing attribute values in data mining. In Proc. Rough
Sets and Current Trends in Computing, pages 378–385. Springer, 2001.
[7] Joseph G Ibrahim, Ming-Hui Chen, Stuart R Lipsitz, and Amy H
Herring. Missing-data methods for generalized linear models: A
comparative review. Journal of the American Statistical Association,
100(469):332–346, 2005.
[8] Roderick JA Little. Missing-data adjustments in large surveys. Journal
of Business & Economic Statistics, 6(3):287–296, 1988.
[9] Roderick JA Little and Donald B Rubin. Statistical analysis with missing
data. John Wiley & Sons, 2014.
[10] Matteo Magnani. Techniques for dealing with missing data in knowledge
discovery tasks. Obtido http://magnanim.web.cs.unibo.it/index.html,
15(01):2007, 2004.
[11] S. Zhang, Z. Qin, C.X. Ling, and S. Sheng. Missing is useful”:
missing values in cost-sensitive decision trees. IEEE Trans. on KDE,
17(12):1689–1693, 2005. [12] Shichao Zhang. Shell-neighbor method and its application in missing
data imputation. Applied Intelligence, 35(1):123–133, 2011.