Abstract: This study demonstrates an alternative stochastic imputation approach for large datasets when preferred commercial packages struggle to iterate due to numerical problems. A large country conflict dataset motivates the search to impute missing values well over a common threshold of 20% missingness. The methodology capitalizes on correlation while using model residuals to provide the uncertainty in estimating unknown values. Examination of the methodology provides insight toward choosing linear or nonlinear modeling terms. Static tolerances common in most packages are replaced with tailorable tolerances that exploit residuals to fit each data element. The methodology evaluation includes observing computation time, model fit, and the comparison of known values to replaced values created through imputation. Overall, the country conflict dataset illustrates promise with modeling first-order interactions, while presenting a need for further refinement that mimics predictive mean matching.
Abstract: A major challenge in medical studies, especially those that are longitudinal, is the problem of missing measurements which hinders the effective application of many machine learning algorithms. Furthermore, recent Alzheimer's Disease studies have focused on the delineation of Early Mild Cognitive Impairment (EMCI) and Late Mild Cognitive Impairment (LMCI) from cognitively normal controls (CN) which is essential for developing effective and early treatment methods. To address the aforementioned challenges, this paper explores the potential of using the eXtreme Gradient Boosting (XGBoost) algorithm in handling missing values in multiclass classification. We seek a generalized classification scheme where all prodromal stages of the disease are considered simultaneously in the classification and decision-making processes. Given the large number of subjects (1631) included in this study and in the presence of almost 28% missing values, we investigated the performance of XGBoost on the classification of the four classes of AD, NC, EMCI, and LMCI. Using 10-fold cross validation technique, XGBoost is shown to outperform other state-of-the-art classification algorithms by 3% in terms of accuracy and F-score. Our model achieved an accuracy of 80.52%, a precision of 80.62% and recall of 80.51%, supporting the more natural and promising multiclass classification.
Abstract: Missing values in real-world datasets are a common
problem. Many algorithms were developed to deal with this
problem, most of them replace the missing values with a fixed
value that was computed based on the observed values. In
our work, we used a distance function based on Bhattacharyya
distance to measure the distance between objects with missing
values. Bhattacharyya distance, which measures the similarity of
two probability distributions. The proposed distance distinguishes
between known and unknown values. Where the distance between
two known values is the Mahalanobis distance. When, on the other
hand, one of them is missing the distance is computed based on the
distribution of the known values, for the coordinate that contains
the missing value. This method was integrated with Wikaya, a
digital health company developing a platform that helps to improve
prevention of chronic diseases such as diabetes and cancer. In order
for Wikaya’s recommendation system to work distance between users
need to be measured. Since there are missing values in the collected
data, there is a need to develop a distance function distances between
incomplete users profiles. To evaluate the accuracy of the proposed
distance function in reflecting the actual similarity between different
objects, when some of them contain missing values, we integrated it
within the framework of k nearest neighbors (kNN) classifier, since
its computation is based only on the similarity between objects. To
validate this, we ran the algorithm over diabetes and breast cancer
datasets, standard benchmark datasets from the UCI repository. Our
experiments show that kNN classifier using our proposed distance
function outperforms the kNN using other existing methods.
Abstract: In this paper, we propose a method to model the
relationship between failure time and degradation for a simple step
stress test where underlying degradation path is linear and different
causes of failure are possible. It is assumed that the intensity function
depends only on the degradation value. No assumptions are made
about the distribution of the failure times. A simple step-stress test
is used to shorten failure time of products and a tampered failure
rate (TFR) model is proposed to describe the effect of the changing
stress on the intensities. We assume that some of the products that
fail during the test have a cause of failure that is only known to
belong to a certain subset of all possible failures. This case is known
as masking. In the presence of masking, the maximum likelihood
estimates (MLEs) of the model parameters are obtained through an
expectation-maximization (EM) algorithm by treating the causes of
failure as missing values. The effect of incomplete information on the
estimation of parameters is studied through a Monte-Carlo simulation.
Finally, a real example is analyzed to illustrate the application of the
proposed methods.
Abstract: Pulmonary Function Tests are important non-invasive
diagnostic tests to assess respiratory impairments and provides
quantifiable measures of lung function. Spirometry is the most
frequently used measure of lung function and plays an essential role
in the diagnosis and management of pulmonary diseases. However,
the test requires considerable patient effort and cooperation,
markedly related to the age of patients resulting in incomplete data
sets. This paper presents, a nonlinear model built using Multivariate
adaptive regression splines and Random forest regression model to
predict the missing spirometric features. Random forest based feature
selection is used to enhance both the generalization capability and the
model interpretability. In the present study, flow-volume data are
recorded for N= 198 subjects. The ranked order of feature importance
index calculated by the random forests model shows that the
spirometric features FVC, FEF25, PEF, FEF25-75, FEF50 and the
demographic parameter height are the important descriptors. A
comparison of performance assessment of both models prove that, the
prediction ability of MARS with the `top two ranked features namely
the FVC and FEF25 is higher, yielding a model fit of R2= 0.96 and
R2= 0.99 for normal and abnormal subjects. The Root Mean Square
Error analysis of the RF model and the MARS model also shows that
the latter is capable of predicting the missing values of FEV1 with a
notably lower error value of 0.0191 (normal subjects) and 0.0106
(abnormal subjects) with the aforementioned input features. It is
concluded that combining feature selection with a prediction model
provides a minimum subset of predominant features to train the
model, as well as yielding better prediction performance. This
analysis can assist clinicians with a intelligence support system in the
medical diagnosis and improvement of clinical care.
Abstract: Missing values in data are common in real world applications. Since the performance of many data mining algorithms depend critically on it being given a good metric over the input space, we decided in this paper to define a distance function for unlabeled
datasets with missing values. We use the Bhattacharyya distance, which measures the similarity of two probability distributions, to define our new distance function. According to this distance, the distance between two points without missing attributes values is simply the Mahalanobis distance. When on the other hand there is a missing value of one of the coordinates, the distance is computed according to the distribution of the missing coordinate. Our distance is general and can be used as part of any algorithm that computes the distance between data points. Because its performance depends strongly on the chosen distance measure, we opted for the k nearest neighbor classifier to evaluate its ability to accurately reflect object similarity. We experimented on standard numerical datasets from the UCI repository from different fields. On these datasets we simulated missing values and compared the performance of the kNN classifier using our distance to other three basic methods. Our experiments show that kNN using our distance function outperforms the kNN using other methods. Moreover, the runtime performance of our method is only slightly higher than the other methods.
Abstract: In this paper, we investigated the characteristic of a
clinical dataseton the feature selection and classification
measurements which deal with missing values problem.And also
posed the appropriated techniques to achieve the aim of the activity;
in this research aims to find features that have high effect to mortality
and mortality time frame. We quantify the complexity of a clinical
dataset. According to the complexity of the dataset, we proposed the
data mining processto cope their complexity; missing values, high
dimensionality, and the prediction problem by using the methods of
missing value replacement, feature selection, and classification.The
experimental results will extend to develop the prediction model for
cardiology.
Abstract: The occurrence of missing values in database is a serious problem for Data Mining tasks, responsible for degrading data quality and accuracy of analyses. In this context, the area has shown a lack of standardization for experiments to treat missing values, introducing difficulties to the evaluation process among different researches due to the absence in the use of common parameters. This paper proposes a testbed intended to facilitate the experiments implementation and provide unbiased parameters using available datasets and suited performance metrics in order to optimize the evaluation and comparison between the state of art missing values treatments.
Abstract: Missing data is a persistent problem in almost all
areas of empirical research. The missing data must be treated very
carefully, as data plays a fundamental role in every analysis.
Improper treatment can distort the analysis or generate biased results.
In this paper, we compare and contrast various imputation techniques
on missing data sets and make an empirical evaluation of these
methods so as to construct quality software models. Our empirical
study is based on NASA-s two public dataset. KC4 and KC1. The
actual data sets of 125 cases and 2107 cases respectively, without
any missing values were considered. The data set is used to create
Missing at Random (MAR) data Listwise Deletion(LD), Mean
Substitution(MS), Interpolation, Regression with an error term and
Expectation-Maximization (EM) approaches were used to compare
the effects of the various techniques.
Abstract: The world economic crises and budget constraints
have caused authorities, especially those in developing countries, to
rationalize water quality monitoring activities. Rationalization
consists of reducing the number of monitoring sites, the number of
samples, and/or the number of water quality variables measured. The
reduction in water quality variables is usually based on correlation. If
two variables exhibit high correlation, it is an indication that some of
the information produced may be redundant. Consequently, one
variable can be discontinued, and the other continues to be measured.
Later, the ordinary least squares (OLS) regression technique is
employed to reconstitute information about discontinued variable by
using the continuously measured one as an explanatory variable. In
this paper, two record extension techniques are employed to
reconstitute information about discontinued water quality variables,
the OLS and the Line of Organic Correlation (LOC). An empirical
experiment is conducted using water quality records from the Nile
Delta water quality monitoring network in Egypt. The record
extension techniques are compared for their ability to predict
different statistical parameters of the discontinued variables. Results
show that the OLS is better at estimating individual water quality
records. However, results indicate an underestimation of the variance
in the extended records. The LOC technique is superior in preserving
characteristics of the entire distribution and avoids underestimation
of the variance. It is concluded from this study that the OLS can be
used for the substitution of missing values, while LOC is preferable
for inferring statements about the probability distribution.
Abstract: MATCH project [1] entitle the development of an
automatic diagnosis system that aims to support treatment of colon
cancer diseases by discovering mutations that occurs to tumour
suppressor genes (TSGs) and contributes to the development of
cancerous tumours. The constitution of the system is based on a)
colon cancer clinical data and b) biological information that will be
derived by data mining techniques from genomic and proteomic
sources The core mining module will consist of the popular, well
tested hybrid feature extraction methods, and new combined
algorithms, designed especially for the project. Elements of rough
sets, evolutionary computing, cluster analysis, self-organization maps
and association rules will be used to discover the annotations
between genes, and their influence on tumours [2]-[11].
The methods used to process the data have to address their high
complexity, potential inconsistency and problems of dealing with the
missing values. They must integrate all the useful information
necessary to solve the expert's question. For this purpose, the system
has to learn from data, or be able to interactively specify by a domain
specialist, the part of the knowledge structure it needs to answer a
given query. The program should also take into account the
importance/rank of the particular parts of data it analyses, and adjusts
the used algorithms accordingly.
Abstract: This paper applies Bayesian Networks to support
information extraction from unstructured, ungrammatical, and
incoherent data sources for semantic annotation. A tool has been
developed that combines ontologies, machine learning, and
information extraction and probabilistic reasoning techniques to
support the extraction process. Data acquisition is performed with the
aid of knowledge specified in the form of ontology. Due to the
variable size of information available on different data sources, it is
often the case that the extracted data contains missing values for
certain variables of interest. It is desirable in such situations to
predict the missing values. The methodology, presented in this paper,
first learns a Bayesian network from the training data and then uses it
to predict missing data and to resolve conflicts. Experiments have
been conducted to analyze the performance of the presented
methodology. The results look promising as the methodology
achieves high degree of precision and recall for information
extraction and reasonably good accuracy for predicting missing
values.