Abstract: Stock selection is an important decision-making problem. Many machine learning and data mining technologies are employed to build automatic stock-selection system. A profitable stock-selection system should consider the stock’s investment value and the market timing. In this paper, we present a hybrid system including both engage for stock selection. This system uses a case-based reasoning (CBR) model to execute the stock classification, uses a decision-tree model to help with market timing and stock selection. The experiments show that the performance of this hybrid system is better than that of other techniques regarding to the classification accuracy, the average return and the Sharpe ratio.
Abstract: In this study, the clients who applied to a bank branch for loan were analyzed through data mining. The study was composed of the information such as amounts of loans received by personal and SME clients working with the bank branch, installment numbers, number of delays in loan installments, payments available in other banks and number of banks to which they are in debt between 2010 and 2013. The client risk profile was examined through Classification and Regression Tree (CART) analysis, one of the decision tree classification methods. At the end of the study, 5 different types of customers have been determined on the decision tree. The classification of these types of customers has been created with the rating of those posing a risk for the bank branch and the customers have been classified according to the risk ratings.
Abstract: This research aims to create mobile tools to analyze rice disease quickly and easily. The principle of object-oriented software engineering and objective-C language were used for software development methodology and the principle of decision tree technique was used for analysis method. Application users can select the features of rice disease or the color appears on the rice leaves for recognition analysis results on iOS mobile screen. After completing the software development, unit testing and integrating testing method were used to check for program validity. In addition, three plant experts and forty farmers have been assessed for usability and benefit of this system. The overall of users’ satisfaction was found in a good level, 57%. The plant experts give a comment on the addition of various disease symptoms in the database for more precise results of the analysis. For further research, it is suggested that image processing system should be developed as a tool that allows users search and analyze for rice diseases more convenient with great accuracy.
Abstract: Classification is an important data mining technique
and could be used as data filtering in artificial intelligence. The
broad application of classification for all kind of data leads to be
used in nearly every field of our modern life. Classification helps us
to put together different items according to the feature items decided
as interesting and useful. In this paper, we compare two
classification methods Naïve Bayes and ADTree use to detect spam
e-mail. This choice is motivated by the fact that Naive Bayes
algorithm is based on probability calculus while ADTree algorithm is
based on decision tree. The parameter settings of the above
classifiers use the maximization of true positive rate and
minimization of false positive rate. The experiment results present
classification accuracy and cost analysis in view of optimal classifier
choice for Spam Detection. It is point out the number of attributes to
obtain a tradeoff between number of them and the classification
accuracy.
Abstract: In recent years, there has been an explosion in the rate of using technology that help discovering the diseases. For example, DNA microarrays allow us for the first time to obtain a "global" view of the cell. It has great potential to provide accurate medical diagnosis, to help in finding the right treatment and cure for many diseases. Various classification algorithms can be applied on such micro-array datasets to devise methods that can predict the occurrence of Leukemia disease. In this study, we compared the classification accuracy and response time among eleven decision tree methods and six rule classifier methods using five performance criteria. The experiment results show that the performance of Random Tree is producing better result. Also it takes lowest time to build model in tree classifier. The classification rules algorithms such as nearest- neighbor-like algorithm (NNge) is the best algorithm due to the high accuracy and it takes lowest time to build model in classification.
Abstract: This paper presents a classifier ensemble approach for
predicting the survivability of the breast cancer patients using the
latest database version of the Surveillance, Epidemiology, and End
Results (SEER) Program of the National Cancer Institute. The system
consists of two main components; features selection and classifier
ensemble components. The features selection component divides the
features in SEER database into four groups. After that it tries to find
the most important features among the four groups that maximizes the
weighted average F-score of a certain classification algorithm. The
ensemble component uses three different classifiers, each of which
models different set of features from SEER through the features
selection module. On top of them, another classifier is used to give
the final decision based on the output decisions and confidence
scores from each of the underlying classifiers. Different classification
algorithms have been examined; the best setup found is by using the
decision tree, Bayesian network, and Na¨ıve Bayes algorithms for the
underlying classifiers and Na¨ıve Bayes for the classifier ensemble
step. The system outperforms all published systems to date when
evaluated against the exact same data of SEER (period of 1973-2002).
It gives 87.39% weighted average F-score compared to 85.82% and
81.34% of the other published systems. By increasing the data size to
cover the whole database (period of 1973-2014), the overall weighted
average F-score jumps to 92.4% on the held out unseen test set.
Abstract: Cancer affects people globally with breast cancer being a leading killer. Breast cancer is due to the uncontrollable multiplication of cells resulting in a tumour or neoplasm. Tumours are called ‘benign’ when cancerous cells do not ravage other body tissues and ‘malignant’ if they do so. As mammography is an effective breast cancer detection tool at an early stage which is the most treatable stage it is the primary imaging modality for screening and diagnosis of this cancer type. This paper presents an automatic mammogram classification technique using wavelet and Gabor filter. Correlation feature selection is used to reduce the feature set and selected features are classified using different decision trees.
Abstract: We assume an IoT-based smart-home environment where the on-off status of each of the electrical appliances including the room lights can be recognized in a real time by monitoring and analyzing the smart meter data. At any moment in such an environment, we can recognize what the household or the user is doing by referring to the status data of the appliances. In this paper, we focus on a smart-home service that is to activate a robot vacuum cleaner at right time by recognizing the user situation, which requires a situation-aware model that can distinguish the situations that allow vacuum cleaning (Yes) from those that do not (No). We learn as our candidate models a few classifiers such as naïve Bayes, decision tree, and logistic regression that can map the appliance-status data into Yes and No situations. Our training and test data are obtained from simulations of user behaviors, in which a sequence of user situations such as cooking, eating, dish washing, and so on is generated with the status of the relevant appliances changed in accordance with the situation changes. During the simulation, both the situation transition and the resulting appliance status are determined stochastically. To compare the performances of the aforementioned classifiers we obtain their learning curves for different types of users through simulations. The result of our empirical study reveals that naïve Bayes achieves a slightly better classification accuracy than the other compared classifiers.
Abstract: Texture is an important characteristic in real and
synthetic scenes. Texture analysis plays a critical role in inspecting
surfaces and provides important techniques in a variety of
applications. Although several descriptors have been presented to
extract texture features, the development of object recognition is still a
difficult task due to the complex aspects of texture. Recently, many
robust and scaling-invariant image features such as SIFT, SURF and
ORB have been successfully used in image retrieval and object
recognition. In this paper, we have tried to compare the performance
for texture classification using these feature descriptors with k-means
clustering. Different classifiers including K-NN, Naive Bayes, Back
Propagation Neural Network , Decision Tree and Kstar were applied in
three texture image sets - UIUCTex, KTH-TIPS and Brodatz,
respectively. Experimental results reveal SIFTS as the best average
accuracy rate holder in UIUCTex, KTH-TIPS and SURF is
advantaged in Brodatz texture set. BP neuro network works best in the
test set classification among all used classifiers.
Abstract: Patient-specific models are instance-based learning
algorithms that take advantage of the particular features of the patient
case at hand to predict an outcome. We introduce two patient-specific
algorithms based on decision tree paradigm that use AUC as a
metric to select an attribute. We apply the patient specific algorithms
to predict outcomes in several datasets, including medical datasets.
Compared to the patient-specific decision path (PSDP) entropy-based
and CART methods, the AUC-based patient-specific decision path
models performed equivalently on area under the ROC curve (AUC).
Our results provide support for patient-specific methods being a
promising approach for making clinical predictions.
Abstract: This work is on decision tree-based classification for
the disbursement of scholarship. Tree-based data mining
classification technique is used in other to determine the generic rule
to be used to disburse the scholarship. The system based on the
defined rules from the tree is able to determine the class (status) to
which an applicant shall belong whether Granted or Not Granted. The
applicants that fall to the class of granted denote a successful
acquirement of scholarship while those in not granted class are
unsuccessful in the scheme. An algorithm that can be used to classify
the applicants based on the rules from tree-based classification was
also developed. The tree-based classification is adopted because of its
efficiency, effectiveness, and easy to comprehend features. The
system was tested with the data of National Information Technology
Development Agency (NITDA) Abuja, a Parastatal of Federal
Ministry of Communication Technology that is mandated to develop
and regulate information technology in Nigeria. The system was
found working according to the specification. It is therefore
recommended for all scholarship disbursement organizations.
Abstract: Feature selection has been used in many fields such as
classification, data mining and object recognition and proven to be
effective for removing irrelevant and redundant features from the
original dataset. In this paper, a new design of distributed intrusion
detection system using a combination feature selection model based
on bees and decision tree. Bees algorithm is used as the search
strategy to find the optimal subset of features, whereas decision tree
is used as a judgment for the selected features. Both the produced
features and the generated rules are used by Decision Making Mobile
Agent to decide whether there is an attack or not in the networks.
Decision Making Mobile Agent will migrate through the networks,
moving from node to another, if it found that there is an attack on one
of the nodes, it then alerts the user through User Interface Agent or
takes some action through Action Mobile Agent. The KDD Cup 99
dataset is used to test the effectiveness of the proposed system. The
results show that even if only four features are used, the proposed
system gives a better performance when it is compared with the
obtained results using all 41 features.
Abstract: Human beings have the ability to make logical
decisions. Although human decision - making is often optimal, it is
insufficient when huge amount of data is to be classified. Medical
dataset is a vital ingredient used in predicting patient’s health
condition. In other to have the best prediction, there calls for most
suitable machine learning algorithms. This work compared the
performance of Artificial Neural Network (ANN) and Decision Tree
Algorithms (DTA) as regards to some performance metrics using
diabetes data. WEKA software was used for the implementation of
the algorithms. Multilayer Perceptron (MLP) and Radial Basis
Function (RBF) were the two algorithms used for ANN, while
RegTree and LADTree algorithms were the DTA models used. From
the results obtained, DTA performed better than ANN. The Root
Mean Squared Error (RMSE) of MLP is 0.3913 that of RBF is
0.3625, that of RepTree is 0.3174 and that of LADTree is 0.3206
respectively.
Abstract: By the evolvement in technology, the way of
expressing opinions switched direction to the digital world. The
domain of politics, as one of the hottest topics of opinion mining
research, merged together with the behavior analysis for affiliation
determination in texts, which constitutes the subject of this paper.
This study aims to classify the text in news/blogs either as
Republican or Democrat with the minimum number of features. As
an initial set, 68 features which 64 were constituted by Linguistic
Inquiry and Word Count (LIWC) features were tested against 14
benchmark classification algorithms. In the later experiments, the
dimensions of the feature vector reduced based on the 7 feature
selection algorithms. The results show that the “Decision Tree”,
“Rule Induction” and “M5 Rule” classifiers when used with “SVM”
and “IGR” feature selection algorithms performed the best up to
82.5% accuracy on a given dataset. Further tests on a single feature
and the linguistic based feature sets showed the similar results. The
feature “Function”, as an aggregate feature of the linguistic category,
was found as the most differentiating feature among the 68 features
with the accuracy of 81% in classifying articles either as Republican
or Democrat.
Abstract: This study investigates the use of a time-series of
MODIS NDVI data to identify agricultural land cover change on an
annual time step (2007 - 2012) and characterize the trend. Following
an ISODATA classification of the MODIS imagery to selectively
mask areas not agriculture or semi-natural, NDVI signatures were
created to identify areas cereals and vineyards with the aid of
ancillary, pictometry and field sample data for 2010. The NDVI
signature curve and training samples were used to create a decision
tree model in WEKA 3.6.9 using decision tree classifier (J48)
algorithm; Model 1 including ISODATA classification and Model 2
not. These two models were then used to classify all data for the
study area for 2010, producing land cover maps with classification
accuracies of 77% and 80% for Model 1 and 2 respectively. Model 2
was subsequently used to create land cover classification and change
detection maps for all other years. Subtle changes and areas of
consistency (unchanged) were observed in the agricultural classes
and crop practices. Over the years as predicted by the land cover
classification. Forty one percent of the catchment comprised of
cereals with 35% possibly following a crop rotation system.
Vineyards largely remained constant with only one percent
conversion to vineyard from other land cover classes.
Abstract: In the past few years, the amount of malicious software
increased exponentially and, therefore, machine learning algorithms
became instrumental in identifying clean and malware files through
(semi)-automated classification. When working with very large
datasets, the major challenge is to reach both a very high malware
detection rate and a very low false positive rate. Another challenge
is to minimize the time needed for the machine learning algorithm to
do so. This paper presents a comparative study between different
machine learning techniques such as linear classifiers, ensembles,
decision trees or various hybrids thereof. The training dataset consists
of approximately 2 million clean files and 200.000 infected files,
which is a realistic quantitative mixture. The paper investigates the
above mentioned methods with respect to both their performance
(detection rate and false positive rate) and their practicability.
Abstract: In this paper, we used data mining to extract
biomedical knowledge. In general, complex biomedical data
collected in studies of populations are treated by statistical methods,
although they are robust, they are not sufficient in themselves to
harness the potential wealth of data. For that you used in step two
learning algorithms: the Decision Trees and Support Vector Machine
(SVM). These supervised classification methods are used to make the
diagnosis of thyroid disease. In this context, we propose to promote
the study and use of symbolic data mining techniques.
Abstract: Existing methods of data mining cannot be applied on
spatial data because they require spatial specificity consideration, as
spatial relationships.
This paper focuses on the classification with decision trees, which
are one of the data mining techniques. We propose an extension of
the C4.5 algorithm for spatial data, based on two different approaches
Join materialization and Querying on the fly the different tables.
Similar works have been done on these two main approaches, the
first - Join materialization - favors the processing time in spite of
memory space, whereas the second - Querying on the fly different
tables- promotes memory space despite of the processing time.
The modified C4.5 algorithm requires three entries tables: a target
table, a neighbor table, and a spatial index join that contains the
possible spatial relationship among the objects in the target table and
those in the neighbor table. Thus, the proposed algorithms are applied
to a spatial data pattern in the accidentology domain.
A comparative study of our approach with other works of
classification by spatial decision trees will be detailed.
Abstract: ‘Steganalysis’ is one of the challenging and attractive interests for the researchers with the development of information hiding techniques. It is the procedure to detect the hidden information from the stego created by known steganographic algorithm. In this paper, a novel feature based image steganalysis technique is proposed. Various statistical moments have been used along with some similarity metric. The proposed steganalysis technique has been designed based on transformation in four wavelet domains, which include Haar, Daubechies, Symlets and Biorthogonal. Each domain is being subjected to various classifiers, namely K-nearest-neighbor, K* Classifier, Locally weighted learning, Naive Bayes classifier, Neural networks, Decision trees and Support vector machines. The experiments are performed on a large set of pictures which are available freely in image database. The system also predicts the different message length definitions.
Abstract: A brief review of the empirical studies on the methodology of the stock market decision support would indicate that they are at a threshold of validating the accuracy of the traditional and the fuzzy, artificial neural network and the decision trees. Many researchers have been attempting to compare these models using various data sets worldwide. However, the research community is on the way to the conclusive confidence in the emerged models. This paper attempts to use the automotive sector stock prices from National Stock Exchange (NSE), India and analyze them for the intra-sectorial support for stock market decisions. The study identifies the significant variables and their lags which affect the price of the stocks using OLS analysis and decision tree classifiers.