Abstract: In the automotive industry test drives are being conducted
during the development of new vehicle models or as a part of
quality assurance of series-production vehicles. The communication
on the in-vehicle network, data from external sensors, or internal
data from the electronic control units is recorded by automotive
data loggers during the test drives. The recordings are used for fault
analysis. Since the resulting data volume is tremendous, manually
analysing each recording in great detail is not feasible.
This paper proposes to use machine learning to support domainexperts
by preventing them from contemplating irrelevant data and
rather pointing them to the relevant parts in the recordings. The
underlying idea is to learn the normal behaviour from available
recordings, i.e. a training set, and then to autonomously detect
unexpected deviations and report them as anomalies.
The one-class support vector machine “support vector data description”
is utilised to calculate distances of feature vectors. SVDDSUBSEQ
is proposed as a novel approach, allowing to classify subsequences
in multivariate time series data. The approach allows to
detect unexpected faults without modelling effort as is shown with
experimental results on recordings from test drives.
Abstract: Text categorization (the assignment of texts in natural language into predefined categories) is an important and extensively studied problem in Machine Learning. Currently, popular techniques developed to deal with this task include many preprocessing and learning algorithms, many of which in turn require tuning nontrivial internal parameters. Although partial studies are available, many authors fail to report values of the parameters they use in their experiments, or reasons why these values were used instead of others. The goal of this work then is to create a more thorough comparison of preprocessing parameters and their mutual influence, and report interesting observations and results.
Abstract: Data mining is the process of sifting through large
volumes of data, analyzing data from different perspectives and
summarizing it into useful information. One of the widely used
desktop applications for data mining is the Weka tool which is
nothing but a collection of machine learning algorithms implemented
in Java and open sourced under the General Public License (GPL). A
web service is a software system designed to support interoperable
machine to machine interaction over a network using SOAP
messages. Unlike a desktop application, a web service is easy to
upgrade, deliver and access and does not occupy any memory on the
system. Keeping in mind the advantages of a web service over a
desktop application, in this paper we are demonstrating how this Java
based desktop data mining application can be implemented as a web
service to support data mining across the internet.
Abstract: The problem of ranking (rank regression) has become popular in the machine learning community. This theory relates to problems, in which one has to predict (guess) the order between objects on the basis of vectors describing their observed features. In many ranking algorithms a convex loss function is used instead of the 0-1 loss. It makes these procedures computationally efficient. Hence, convex risk minimizers and their statistical properties are investigated in this paper. Fast rates of convergence are obtained under conditions, that look similarly to the ones from the classification theory. Methods used in this paper come from the theory of U-processes as well as empirical processes.
Abstract: A cognitive collaborative reinforcement learning
algorithm (CCRL) that incorporates an advisor into the learning
process is developed to improve supervised learning. An autonomous
learner is enabled with a self awareness cognitive skill to decide
when to solicit instructions from the advisor. The learner can also
assess the value of advice, and accept or reject it. The method is
evaluated for robotic motion planning using simulation. Tests are
conducted for advisors with skill levels from expert to novice. The
CCRL algorithm and a combined method integrating its logic with
Clouse-s Introspection Approach, outperformed a base-line fully
autonomous learner, and demonstrated robust performance when
dealing with various advisor skill levels, learning to accept advice
received from an expert, while rejecting that of less skilled
collaborators. Although the CCRL algorithm is based on RL, it fits
other machine learning methods, since advisor-s actions are only
added to the outer layer.
Abstract: Support vector machines (SVMs) are considered to be
the best machine learning algorithms for minimizing the predictive
probability of misclassification. However, their drawback is that for
large data sets the computation of the optimal decision boundary is a
time consuming function of the size of the training set. Hence several
methods have been proposed to speed up the SVM algorithm. Here
three methods used to speed up the computation of the SVM
classifiers are compared experimentally using a musical genre
classification problem. The simplest method pre-selects a random
sample of the data before the application of the SVM algorithm. Two
additional methods use proximity graphs to pre-select data that are
near the decision boundary. One uses k-Nearest Neighbor graphs and
the other Relative Neighborhood Graphs to accomplish the task.
Abstract: This paper presents the methodology from machine
learning approaches for short-term rain forecasting system. Decision
Tree, Artificial Neural Network (ANN), and Support Vector Machine
(SVM) were applied to develop classification and prediction models
for rainfall forecasts. The goals of this presentation are to
demonstrate (1) how feature selection can be used to identify the
relationships between rainfall occurrences and other weather
conditions and (2) what models can be developed and deployed for
predicting the accurate rainfall estimates to support the decisions to
launch the cloud seeding operations in the northeastern part of
Thailand. Datasets collected during 2004-2006 from the
Chalermprakiat Royal Rain Making Research Center at Hua Hin,
Prachuap Khiri khan, the Chalermprakiat Royal Rain Making
Research Center at Pimai, Nakhon Ratchasima and Thai
Meteorological Department (TMD). A total of 179 records with 57
features was merged and matched by unique date. There are three
main parts in this work. Firstly, a decision tree induction algorithm
(C4.5) was used to classify the rain status into either rain or no-rain.
The overall accuracy of classification tree achieves 94.41% with the
five-fold cross validation. The C4.5 algorithm was also used to
classify the rain amount into three classes as no-rain (0-0.1 mm.),
few-rain (0.1- 10 mm.), and moderate-rain (>10 mm.) and the overall
accuracy of classification tree achieves 62.57%. Secondly, an ANN
was applied to predict the rainfall amount and the root mean square
error (RMSE) were used to measure the training and testing errors of
the ANN. It is found that the ANN yields a lower RMSE at 0.171 for
daily rainfall estimates, when compared to next-day and next-2-day
estimation. Thirdly, the ANN and SVM techniques were also used to
classify the rain amount into three classes as no-rain, few-rain, and
moderate-rain as above. The results achieved in 68.15% and 69.10%
of overall accuracy of same-day prediction for the ANN and SVM
models, respectively. The obtained results illustrated the comparison
of the predictive power of different methods for rainfall estimation.
Abstract: In this paper a combined feature selection method is
proposed which takes advantages of sample domain filtering,
resampling and feature subset evaluation methods to reduce
dimensions of huge datasets and select reliable features. This method
utilizes both feature space and sample domain to improve the process
of feature selection and uses a combination of Chi squared with
Consistency attribute evaluation methods to seek reliable features.
This method consists of two phases. The first phase filters and
resamples the sample domain and the second phase adopts a hybrid
procedure to find the optimal feature space by applying Chi squared,
Consistency subset evaluation methods and genetic search.
Experiments on various sized datasets from UCI Repository of
Machine Learning databases show that the performance of five
classifiers (Naïve Bayes, Logistic, Multilayer Perceptron, Best First
Decision Tree and JRIP) improves simultaneously and the
classification error for these classifiers decreases considerably. The
experiments also show that this method outperforms other feature
selection methods.
Abstract: Society has grown to rely on Internet services, and the
number of Internet users increases every day. As more and more
users become connected to the network, the window of opportunity
for malicious users to do their damage becomes very great and
lucrative. The objective of this paper is to incorporate different
techniques into classier system to detect and classify intrusion from
normal network packet. Among several techniques, Steady State
Genetic-based Machine Leaning Algorithm (SSGBML) will be used
to detect intrusions. Where Steady State Genetic Algorithm (SSGA),
Simple Genetic Algorithm (SGA), Modified Genetic Algorithm and
Zeroth Level Classifier system are investigated in this research.
SSGA is used as a discovery mechanism instead of SGA. SGA
replaces all old rules with new produced rule preventing old good
rules from participating in the next rule generation. Zeroth Level
Classifier System is used to play the role of detector by matching
incoming environment message with classifiers to determine whether
the current message is normal or intrusion and receiving feedback
from environment. Finally, in order to attain the best results,
Modified SSGA will enhance our discovery engine by using Fuzzy
Logic to optimize crossover and mutation probability. The
experiments and evaluations of the proposed method were performed
with the KDD 99 intrusion detection dataset.
Abstract: Traffic Management and Information Systems, which rely on a system of sensors, aim to describe in real-time traffic in urban areas using a set of parameters and estimating them. Though the state of the art focuses on data analysis, little is done in the sense of prediction. In this paper, we describe a machine learning system for traffic flow management and control for a prediction of traffic flow problem. This new algorithm is obtained by combining Random Forests algorithm into Adaboost algorithm as a weak learner. We show that our algorithm performs relatively well on real data, and enables, according to the Traffic Flow Evaluation model, to estimate and predict whether there is congestion or not at a given time on road intersections.
Abstract: The aim of this paper is to present a methodology in
three steps to forecast supply chain demand. In first step, various data
mining techniques are applied in order to prepare data for entering
into forecasting models. In second step, the modeling step, an
artificial neural network and support vector machine is presented
after defining Mean Absolute Percentage Error index for measuring
error. The structure of artificial neural network is selected based on
previous researchers' results and in this article the accuracy of
network is increased by using sensitivity analysis. The best forecast
for classical forecasting methods (Moving Average, Exponential
Smoothing, and Exponential Smoothing with Trend) is resulted based
on prepared data and this forecast is compared with result of support
vector machine and proposed artificial neural network. The results
show that artificial neural network can forecast more precisely in
comparison with other methods. Finally, forecasting methods'
stability is analyzed by using raw data and even the effectiveness of
clustering analysis is measured.
Abstract: Bagging and boosting are among the most popular resampling ensemble methods that generate and combine a diversity of classifiers using the same learning algorithm for the base-classifiers. Boosting algorithms are considered stronger than bagging on noisefree data. However, there are strong empirical indications that bagging is much more robust than boosting in noisy settings. For this reason, in this work we built an ensemble using a voting methodology of bagging and boosting ensembles with 10 subclassifiers in each one. We performed a comparison with simple bagging and boosting ensembles with 25 sub-classifiers, as well as other well known combining methods, on standard benchmark datasets and the proposed technique was the most accurate.
Abstract: Classification is one of the primary themes in
computational biology. The accuracy of classification strongly
depends on quality of a dataset, and we need some method to
evaluate this quality. In this paper, we propose a new graphical
analysis method using 'Membership-Deviation Graph (MDG)' for
analyzing quality of a dataset. MDG represents degree of
membership and deviations for instances of a class in the dataset. The
result of MDG analysis is used for understanding specific feature and
for selecting best feature for classification.
Abstract: Designing a simulated system and training it to optimize its tasks in simulated environment helps the designers to avoid problems that may appear when designing the system directly in real world. These problems are: time consuming, high cost, high errors percentage and low efficiency and accuracy of the system. The proposed system will investigate and improve the efficiency and accuracy of a simulated robot to choose correct behavior to perform its task. In this paper, machine learning, which uses genetic algorithm, is adopted. This type of machine learning is called genetic-based machine learning in which a distributed classifier system is used to improve the efficiency and accuracy of the robot. Consequently, it helps the robot to achieve optimal action.
Abstract: The major goal in defining and examining game
scenarios is to find good strategies as solutions to the game. A
plausible solution is a recommendation to the players on how to play
the game, which is represented as strategies guided by the various
choices available to the players. These choices invariably compel the
players (decision makers) to execute an action following some
conscious tactics. In this paper, we proposed a refinement-based
heuristic as a machine learning technique for human-like decision
making in playing Ayo game. The result showed that our machine
learning technique is more adaptable and more responsive in making
decision than human intelligence. The technique has the advantage
that a search is astutely conducted in a shallow horizon game tree.
Our simulation was tested against Awale shareware and an appealing
result was obtained.
Abstract: Leo Breimans Random Forests (RF) is a recent
development in tree based classifiers and quickly proven to be one of
the most important algorithms in the machine learning literature. It
has shown robust and improved results of classifications on standard
data sets. Ensemble learning algorithms such as AdaBoost and
Bagging have been in active research and shown improvements in
classification results for several benchmarking data sets with mainly
decision trees as their base classifiers. In this paper we experiment to
apply these Meta learning techniques to the random forests. We
experiment the working of the ensembles of random forests on the
standard data sets available in UCI data sets. We compare the
original random forest algorithm with their ensemble counterparts
and discuss the results.
Abstract: Understanding proteins functions is a major goal in
the post-genomic era. Proteins usually work in context of other
proteins and rarely function alone. Therefore, it is highly relevant to
study the interaction partners of a protein in order to understand its
function. Machine learning techniques have been widely applied to
predict protein-protein interactions. Kernel functions play an
important role for a successful machine learning technique. Choosing
the appropriate kernel function can lead to a better accuracy in a
binary classifier such as the support vector machines. In this paper,
we describe a Bayesian kernel for the support vector machine to
predict protein-protein interactions. The use of Bayesian kernel can
improve the classifier performance by incorporating the probability
characteristic of the available experimental protein-protein
interactions data that were compiled from different sources. In
addition, the probabilistic output from the Bayesian kernel can assist
biologists to conduct more research on the highly predicted
interactions. The results show that the accuracy of the classifier has
been improved using the Bayesian kernel compared to the standard
SVM kernels. These results imply that protein-protein interaction can
be predicted using Bayesian kernel with better accuracy compared to
the standard SVM kernels.
Abstract: Corporate credit rating prediction using statistical and
artificial intelligence (AI) techniques has been one of the attractive
research topics in the literature. In recent years, multiclass
classification models such as artificial neural network (ANN) or
multiclass support vector machine (MSVM) have become a very
appealing machine learning approaches due to their good
performance. However, most of them have only focused on classifying
samples into nominal categories, thus the unique characteristic of the
credit rating - ordinality - has been seldom considered in their
approaches. This study proposes new types of ANN and MSVM
classifiers, which are named OMANN and OMSVM respectively.
OMANN and OMSVM are designed to extend binary ANN or SVM
classifiers by applying ordinal pairwise partitioning (OPP) strategy.
These models can handle ordinal multiple classes efficiently and
effectively. To validate the usefulness of these two models, we applied
them to the real-world bond rating case. We compared the results of
our models to those of conventional approaches. The experimental
results showed that our proposed models improve classification
accuracy in comparison to typical multiclass classification techniques
with the reduced computation resource.
Abstract: As the majority of faults are found in a few of its modules so there is a need to investigate the modules that are affected severely as compared to other modules and proper maintenance need to be done on time especially for the critical applications. In this paper, we have explored the different predictor models to NASA-s public domain defect dataset coded in Perl programming language. Different machine learning algorithms belonging to the different learner categories of the WEKA project including Mamdani Based Fuzzy Inference System and Neuro-fuzzy based system have been evaluated for the modeling of maintenance severity or impact of fault severity. The results are recorded in terms of Accuracy, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). The results show that Neuro-fuzzy based model provides relatively better prediction accuracy as compared to other models and hence, can be used for the maintenance severity prediction of the software.
Abstract: The belief decision tree (BDT) approach is a decision
tree in an uncertain environment where the uncertainty is represented
through the Transferable Belief Model (TBM), one interpretation
of the belief function theory. The uncertainty can appear either in
the actual class of training objects or attribute values of objects to
classify. In this paper, we develop a post-pruning method of belief
decision trees in order to reduce size and improve classification
accuracy on unseen cases. The pruning of decision tree has a
considerable intention in the areas of machine learning.