Abstract: Human beings have the ability to make logical
decisions. Although human decision - making is often optimal, it is
insufficient when huge amount of data is to be classified. Medical
dataset is a vital ingredient used in predicting patient’s health
condition. In other to have the best prediction, there calls for most
suitable machine learning algorithms. This work compared the
performance of Artificial Neural Network (ANN) and Decision Tree
Algorithms (DTA) as regards to some performance metrics using
diabetes data. WEKA software was used for the implementation of
the algorithms. Multilayer Perceptron (MLP) and Radial Basis
Function (RBF) were the two algorithms used for ANN, while
RegTree and LADTree algorithms were the DTA models used. From
the results obtained, DTA performed better than ANN. The Root
Mean Squared Error (RMSE) of MLP is 0.3913 that of RBF is
0.3625, that of RepTree is 0.3174 and that of LADTree is 0.3206
respectively.
Abstract: Today, there is a large number of political transcripts
available on the Web to be mined and used for statistical analysis,
and product recommendations. As the online political resources are
used for various purposes, automatically determining the political
orientation on these transcripts becomes crucial. The methodologies
used by machine learning algorithms to do an automatic classification
are based on different features that are classified under categories
such as Linguistic, Personality etc. Considering the ideological
differences between Liberals and Conservatives, in this paper, the
effect of Personality traits on political orientation classification is
studied. The experiments in this study were based on the correlation
between LIWC features and the BIG Five Personality traits. Several
experiments were conducted using Convote U.S. Congressional-
Speech dataset with seven benchmark classification algorithms. The
different methodologies were applied on several LIWC feature sets
that constituted by 8 to 64 varying number of features that are
correlated to five personality traits. As results of experiments,
Neuroticism trait was obtained to be the most differentiating
personality trait for classification of political orientation. At the same
time, it was observed that the personality trait based classification
methodology gives better and comparable results with the related
work.
Abstract: In the past few years, the amount of malicious software
increased exponentially and, therefore, machine learning algorithms
became instrumental in identifying clean and malware files through
(semi)-automated classification. When working with very large
datasets, the major challenge is to reach both a very high malware
detection rate and a very low false positive rate. Another challenge
is to minimize the time needed for the machine learning algorithm to
do so. This paper presents a comparative study between different
machine learning techniques such as linear classifiers, ensembles,
decision trees or various hybrids thereof. The training dataset consists
of approximately 2 million clean files and 200.000 infected files,
which is a realistic quantitative mixture. The paper investigates the
above mentioned methods with respect to both their performance
(detection rate and false positive rate) and their practicability.
Abstract: Customer churn prediction is one of the most useful
areas of study in customer analytics. Due to the enormous amount
of data available for such predictions, machine learning and data
mining have been heavily used in this domain. There exist many
machine learning algorithms directly applicable for the problem of
customer churn prediction, and here, we attempt to experiment on
a novel approach by using a cognitive learning based technique in
an attempt to improve the results obtained by using a combination
of supervised learning methods, with cognitive unsupervised learning
methods.
Abstract: A Distributed Denial of Service (DDoS) attack is a
major threat to cyber security. It originates from the network layer or
the application layer of compromised/attacker systems which are
connected to the network. The impact of this attack ranges from the
simple inconvenience to use a particular service to causing major
failures at the targeted server. When there is heavy traffic flow to a
target server, it is necessary to classify the legitimate access and
attacks. In this paper, a novel method is proposed to detect DDoS
attacks from the traces of traffic flow. An access matrix is created
from the traces. As the access matrix is multi dimensional, Principle
Component Analysis (PCA) is used to reduce the attributes used for
detection. Two classifiers Naive Bayes and K-Nearest neighborhood
are used to classify the traffic as normal or abnormal. The
performance of the classifier with PCA selected attributes and actual
attributes of access matrix is compared by the detection rate and
False Positive Rate (FPR).
Abstract: Red blood cells (RBC) are the most common types of
blood cells and are the most intensively studied in cell biology. The
lack of RBCs is a condition in which the amount of hemoglobin level
is lower than normal and is referred to as “anemia”. Abnormalities in
RBCs will affect the exchange of oxygen. This paper presents a
comparative study for various techniques for classifying the RBCs as
normal or abnormal (anemic) using WEKA. WEKA is an open
source consists of different machine learning algorithms for data
mining applications. The algorithms tested are Radial Basis Function
neural network, Support vector machine, and K-Nearest Neighbors
algorithm. Two sets of combined features were utilized for
classification of blood cells images. The first set, exclusively consist
of geometrical features, was used to identify whether the tested blood
cell has a spherical shape or non-spherical cells. While the second
set, consist mainly of textural features was used to recognize the
types of the spherical cells. We have provided an evaluation based on
applying these classification methods to our RBCs image dataset
which were obtained from Serdang Hospital - Malaysia, and
measuring the accuracy of test results. The best achieved
classification rates are 97%, 98%, and 79% for Support vector
machines, Radial Basis Function neural network, and K-Nearest
Neighbors algorithm respectively.
Abstract: Feature selection has recently been the subject of intensive research in data mining, specially for datasets with a large number of attributes. Recent work has shown that feature selection can have a positive effect on the performance of machine learning algorithms. The success of many learning algorithms in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. In this paper, a novel feature search procedure that utilizes the Ant Colony Optimization (ACO) is presented. The ACO is a metaheuristic inspired by the behavior of real ants in their search for the shortest paths to food sources. It looks for optimal solutions by considering both local heuristics and previous knowledge. When applied to two different classification problems, the proposed algorithm achieved very promising results.
Abstract: This paper summarizes the results of some experiments for finding the effective features for disambiguation of Turkish verbs. Word sense disambiguation is a current area of investigation in which verbs have the dominant role. Generally verbs have more senses than the other types of words in the average and detecting these features for verbs may lead to some improvements for other word types. In this paper we have considered only the syntactical features that can be obtained from the corpus and tested by using some famous machine learning algorithms.
Abstract: Data mining is the process of sifting through large
volumes of data, analyzing data from different perspectives and
summarizing it into useful information. One of the widely used
desktop applications for data mining is the Weka tool which is
nothing but a collection of machine learning algorithms implemented
in Java and open sourced under the General Public License (GPL). A
web service is a software system designed to support interoperable
machine to machine interaction over a network using SOAP
messages. Unlike a desktop application, a web service is easy to
upgrade, deliver and access and does not occupy any memory on the
system. Keeping in mind the advantages of a web service over a
desktop application, in this paper we are demonstrating how this Java
based desktop data mining application can be implemented as a web
service to support data mining across the internet.
Abstract: Support vector machines (SVMs) are considered to be
the best machine learning algorithms for minimizing the predictive
probability of misclassification. However, their drawback is that for
large data sets the computation of the optimal decision boundary is a
time consuming function of the size of the training set. Hence several
methods have been proposed to speed up the SVM algorithm. Here
three methods used to speed up the computation of the SVM
classifiers are compared experimentally using a musical genre
classification problem. The simplest method pre-selects a random
sample of the data before the application of the SVM algorithm. Two
additional methods use proximity graphs to pre-select data that are
near the decision boundary. One uses k-Nearest Neighbor graphs and
the other Relative Neighborhood Graphs to accomplish the task.
Abstract: As the majority of faults are found in a few of its modules so there is a need to investigate the modules that are affected severely as compared to other modules and proper maintenance need to be done on time especially for the critical applications. In this paper, we have explored the different predictor models to NASA-s public domain defect dataset coded in Perl programming language. Different machine learning algorithms belonging to the different learner categories of the WEKA project including Mamdani Based Fuzzy Inference System and Neuro-fuzzy based system have been evaluated for the modeling of maintenance severity or impact of fault severity. The results are recorded in terms of Accuracy, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). The results show that Neuro-fuzzy based model provides relatively better prediction accuracy as compared to other models and hence, can be used for the maintenance severity prediction of the software.