Abstract: The purpose of this paper is to develop models that would enable predicting student success. These models could improve allocation of students among colleges and optimize the newly introduced model of government subsidies for higher education. For the purpose of collecting data, an anonymous survey was carried out in the last year of undergraduate degree student population using random sampling method. Decision trees were created of which two have been chosen that were most successful in predicting student success based on two criteria: Grade Point Average (GPA) and time that a student needs to finish the undergraduate program (time-to-degree). Decision trees have been shown as a good method of classification student success and they could be even more improved by increasing survey sample and developing specialized decision trees for each type of college. These types of methods have a big potential for use in decision support systems.
Abstract: The main aim of this study is to identify the most
influential variables that cause defects on the items produced by a
casting company located in Turkey. To this end, one of the items
produced by the company with high defective percentage rates is
selected. Two approaches-the regression analysis and decision treesare
used to model the relationship between process parameters and
defect types. Although logistic regression models failed, decision tree
model gives meaningful results. Based on these results, it can be
claimed that the decision tree approach is a promising technique for
determining the most important process variables.
Abstract: Software metric is a measure of some property of a
piece of software or its specification. The aim of this paper is to
present an application of evolutionary decision trees in software
engineering in order to classify the software modules that have or
have not one or more reported defects. For this some metrics are used
for detecting the class of modules with defects or without defects.
Abstract: Although backpropagation ANNs generally predict
better than decision trees do for pattern classification problems, they
are often regarded as black boxes, i.e., their predictions cannot be
explained as those of decision trees. In many applications, it is
desirable to extract knowledge from trained ANNs for the users to
gain a better understanding of how the networks solve the problems.
A new rule extraction algorithm, called rule extraction from artificial
neural networks (REANN) is proposed and implemented to extract
symbolic rules from ANNs. A standard three-layer feedforward ANN
is the basis of the algorithm. A four-phase training algorithm is
proposed for backpropagation learning. Explicitness of the extracted
rules is supported by comparing them to the symbolic rules generated
by other methods. Extracted rules are comparable with other methods
in terms of number of rules, average number of conditions for a rule,
and predictive accuracy. Extensive experimental studies on several
benchmarks classification problems, such as breast cancer, iris,
diabetes, and season classification problems, demonstrate the
effectiveness of the proposed approach with good generalization
ability.
Abstract: Ensemble learning algorithms such as AdaBoost and
Bagging have been in active research and shown improvements in
classification results for several benchmarking data sets with mainly
decision trees as their base classifiers. In this paper we experiment to
apply these Meta learning techniques with classifiers such as random
forests, neural networks and support vector machines. The data sets
are from MAGIC, a Cherenkov telescope experiment. The task is to
classify gamma signals from overwhelmingly hadron and muon
signals representing a rare class classification problem. We compare
the individual classifiers with their ensemble counterparts and
discuss the results. WEKA a wonderful tool for machine learning has
been used for making the experiments.
Abstract: Chronic hepatitis B can evolve to cirrhosis and liver
cancer. Interferon is the only effective treatment, for carefully selected
patients, but it is very expensive. Some of the selection criteria are
based on liver biopsy, an invasive, costly and painful medical procedure.
Therefore, developing efficient non-invasive selection systems,
could be in the patients benefit and also save money. We investigated
the possibility to create intelligent systems to assist the Interferon
therapeutical decision, mainly by predicting with acceptable accuracy
the results of the biopsy. We used a knowledge discovery in integrated
medical data - imaging, clinical, and laboratory data. The resulted
intelligent systems, tested on 500 patients with chronic hepatitis
B, based on C5.0 decision trees and boosting, predict with 100%
accuracy the results of the liver biopsy. Also, by integrating the other
patients selection criteria, they offer a non-invasive support for the
correct Interferon therapeutic decision. To our best knowledge, these
decision systems outperformed all similar systems published in the
literature, and offer a realistic opportunity to replace liver biopsy in
this medical context.
Abstract: In this paper a new method is suggested for
distributed data-mining by the probability patterns. These patterns
use decision trees and decision graphs. The patterns are cared to be
valid, novel, useful, and understandable. Considering a set of
functions, the system reaches to a good pattern or better objectives.
By using the suggested method we will be able to extract the useful
information from massive and multi-relational data bases.
Abstract: General requirements for knowledge representation in
the form of logic rules, applicable to design and control of industrial
processes, are formulated. Characteristic behavior of decision trees
(DTs) and rough sets theory (RST) in rules extraction from recorded
data is discussed and illustrated with simple examples. The
significance of the models- drawbacks was evaluated, using
simulated and industrial data sets. It is concluded that performance of
DTs may be considerably poorer in several important aspects,
compared to RST, particularly when not only a characterization of a
problem is required, but also detailed and precise rules are needed,
according to actual, specific problems to be solved.
Abstract: Many supervised induction algorithms require discrete
data, even while real data often comes in a discrete
and continuous formats. Quality discretization of continuous
attributes is an important problem that has effects on speed,
accuracy and understandability of the induction models. Usually,
discretization and other types of statistical processes are applied
to subsets of the population as the entire population is practically
inaccessible. For this reason we argue that the discretization
performed on a sample of the population is only an estimate of
the entire population. Most of the existing discretization methods,
partition the attribute range into two or several intervals using
a single or a set of cut points. In this paper, we introduce a
technique by using resampling (such as bootstrap) to generate
a set of candidate discretization points and thus, improving the
discretization quality by providing a better estimation towards
the entire population. Thus, the goal of this paper is to observe
whether the resampling technique can lead to better discretization
points, which opens up a new paradigm to construction of
soft decision trees.
Abstract: Recommender systems are usually regarded as an
important marketing tool in the e-commerce. They use important
information about users to facilitate accurate recommendation. The
information includes user context such as location, time and interest
for personalization of mobile users. We can easily collect information
about location and time because mobile devices communicate with the
base station of the service provider. However, information about user
interest can-t be easily collected because user interest can not be
captured automatically without user-s approval process. User interest
usually represented as a need. In this study, we classify needs into two
types according to prior research. This study investigates the
usefulness of data mining techniques for classifying user need type for
recommendation systems. We employ several data mining techniques
including artificial neural networks, decision trees, case-based
reasoning, and multivariate discriminant analysis. Experimental
results show that CHAID algorithm outperforms other models for
classifying user need type. This study performs McNemar test to
examine the statistical significance of the differences of classification
results. The results of McNemar test also show that CHAID performs
better than the other models with statistical significance.
Abstract: This paper investigates the issue of building decision
trees from data with imprecise class values where imprecision is
encoded in the form of possibility distributions. The Information
Affinity similarity measure is introduced into the well-known gain
ratio criterion in order to assess the homogeneity of a set of
possibility distributions representing instances-s classes belonging to
a given training partition. For the experimental study, we proposed an
information affinity based performance criterion which we have used
in order to show the performance of the approach on well-known
benchmarks.
Abstract: Data Mining aims at discovering knowledge out of
data and presenting it in a form that is easily comprehensible to
humans. One of the useful applications in Egypt is the Cancer
management, especially the management of Acute Lymphoblastic
Leukemia or ALL, which is the most common type of cancer in
children.
This paper discusses the process of designing a prototype that can
help in the management of childhood ALL, which has a great
significance in the health care field. Besides, it has a social impact
on decreasing the rate of infection in children in Egypt. It also
provides valubale information about the distribution and
segmentation of ALL in Egypt, which may be linked to the possible
risk factors.
Undirected Knowledge Discovery is used since, in the case of this
research project, there is no target field as the data provided is
mainly subjective. This is done in order to quantify the subjective
variables. Therefore, the computer will be asked to identify
significant patterns in the provided medical data about ALL. This
may be achieved through collecting the data necessary for the
system, determimng the data mining technique to be used for the
system, and choosing the most suitable implementation tool for the
domain.
The research makes use of a data mining tool, Clementine, so as to
apply Decision Trees technique. We feed it with data extracted from
real-life cases taken from specialized Cancer Institutes. Relevant
medical cases details such as patient medical history and diagnosis
are analyzed, classified, and clustered in order to improve the disease
management.
Abstract: Leo Breimans Random Forests (RF) is a recent
development in tree based classifiers and quickly proven to be one of
the most important algorithms in the machine learning literature. It
has shown robust and improved results of classifications on standard
data sets. Ensemble learning algorithms such as AdaBoost and
Bagging have been in active research and shown improvements in
classification results for several benchmarking data sets with mainly
decision trees as their base classifiers. In this paper we experiment to
apply these Meta learning techniques to the random forests. We
experiment the working of the ensembles of random forests on the
standard data sets available in UCI data sets. We compare the
original random forest algorithm with their ensemble counterparts
and discuss the results.
Abstract: The belief decision tree (BDT) approach is a decision
tree in an uncertain environment where the uncertainty is represented
through the Transferable Belief Model (TBM), one interpretation
of the belief function theory. The uncertainty can appear either in
the actual class of training objects or attribute values of objects to
classify. In this paper, we develop a post-pruning method of belief
decision trees in order to reduce size and improve classification
accuracy on unseen cases. The pruning of decision tree has a
considerable intention in the areas of machine learning.
Abstract: Random Forests are a powerful classification technique, consisting of a collection of decision trees. One useful feature of Random Forests is the ability to determine the importance of each variable in predicting the outcome. This is done by permuting each variable and computing the change in prediction accuracy before and after the permutation. This variable importance calculation is similar to a one-factor-at a time experiment and therefore is inefficient. In this paper, we use a regular fractional factorial design to determine which variables to permute. Based on the results of the trials in the experiment, we calculate the individual importance of the variables, with improved precision over the standard method. The method is illustrated with a study of student attrition at Monash University.
Abstract: The aim of this paper is to identify the most suitable
model for churn prediction based on three different techniques. The
paper identifies the variables that affect churn in reverence of
customer complaints data and provides a comparative analysis of
neural networks, regression trees and regression in their capabilities
of predicting customer churn.
Abstract: Recently, the issue of machine condition monitoring
and fault diagnosis as a part of maintenance system became global
due to the potential advantages to be gained from reduced
maintenance costs, improved productivity and increased machine
availability. The aim of this work is to investigate the effectiveness
of a new fault diagnosis method based on power spectral density
(PSD) of vibration signals in combination with decision trees and
fuzzy inference system (FIS). To this end, a series of studies was
conducted on an external gear hydraulic pump. After a test under
normal condition, a number of different machine defect conditions
were introduced for three working levels of pump speed (1000, 1500,
and 2000 rpm), corresponding to (i) Journal-bearing with inner face
wear (BIFW), (ii) Gear with tooth face wear (GTFW), and (iii)
Journal-bearing with inner face wear plus Gear with tooth face wear
(B&GW). The features of PSD values of vibration signal were
extracted using descriptive statistical parameters. J48 algorithm is
used as a feature selection procedure to select pertinent features from
data set. The output of J48 algorithm was employed to produce the
crisp if-then rule and membership function sets. The structure of FIS
classifier was then defined based on the crisp sets. In order to
evaluate the proposed PSD-J48-FIS model, the data sets obtained
from vibration signals of the pump were used. Results showed that
the total classification accuracy for 1000, 1500, and 2000 rpm
conditions were 96.42%, 100%, and 96.42% respectively. The results
indicate that the combined PSD-J48-FIS model has the potential for
fault diagnosis of hydraulic pumps.
Abstract: It is well known that Logistic Regression is the gold
standard method for predicting clinical outcome, especially
predicting risk of mortality. In this paper, the Decision Tree method
has been proposed to solve specific problems that commonly use
Logistic Regression as a solution. The Biochemistry and
Haematology Outcome Model (BHOM) dataset obtained from
Portsmouth NHS Hospital from 1 January to 31 December 2001 was
divided into four subsets. One subset of training data was used to
generate a model, and the model obtained was then applied to three
testing datasets. The performance of each model from both methods
was then compared using calibration (the χ2 test or chi-test) and
discrimination (area under ROC curve or c-index). The experiment
presented that both methods have reasonable results in the case of the
c-index. However, in some cases the calibration value (χ2) obtained
quite a high result. After conducting experiments and investigating
the advantages and disadvantages of each method, we can conclude
that Decision Trees can be seen as a worthy alternative to Logistic
Regression in the area of Data Mining.