Abstract: This work deals with aspects of support vector machine learning for large-scale data mining tasks. Based on a decomposition algorithm for support vector machine training that can be run in serial as well as shared memory parallel mode we introduce a transformation of the training data that allows for the usage of an expensive generalized kernel without additional costs. We present experiments for the Gaussian kernel, but usage of other kernel functions is possible, too. In order to further speed up the decomposition algorithm we analyze the critical problem of working set selection for large training data sets. In addition, we analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our tests and conclusions led to several modifications of the algorithm and the improvement of overall support vector machine learning performance. Our method allows for using extensive parameter search methods to optimize classification accuracy.
Abstract: As the use of registration packages spreads, the number of the aligned image pairs in image databases (either by manual or automatic methods) increases dramatically. These image pairs can serve as a set of training data. Correspondingly, the images that are to be registered serve as testing data. In this paper, a novel medical image registration method is proposed which is based on the a priori knowledge of the expected joint intensity distribution estimated from pre-aligned training images. The goal of the registration is to find the optimal transformation such that the distance between the observed joint intensity distribution obtained from the testing image pair and the expected joint intensity distribution obtained from the corresponding training image pair is minimized. The distance is measured using the divergence measure based on Tsallis entropy. Experimental results show that, compared with the widely-used Shannon mutual information as well as Tsallis mutual information, the proposed method is computationally more efficient without sacrificing registration accuracy.
Abstract: Despite the fact that Arabic language is currently one
of the most common languages worldwide, there has been only a
little research on Arabic speech recognition relative to other
languages such as English and Japanese. Generally, digital speech
processing and voice recognition algorithms are of special
importance for designing efficient, accurate, as well as fast automatic
speech recognition systems. However, the speech recognition process
carried out in this paper is divided into three stages as follows: firstly,
the signal is preprocessed to reduce noise effects. After that, the
signal is digitized and hearingized. Consequently, the voice activity
regions are segmented using voice activity detection (VAD)
algorithm. Secondly, features are extracted from the speech signal
using Mel-frequency cepstral coefficients (MFCC) algorithm.
Moreover, delta and acceleration (delta-delta) coefficients have been
added for the reason of improving the recognition accuracy. Finally,
each test word-s features are compared to the training database using
dynamic time warping (DTW) algorithm. Utilizing the best set up
made for all affected parameters to the aforementioned techniques,
the proposed system achieved a recognition rate of about 98.5%
which outperformed other HMM and ANN-based approaches
available in the literature.
Abstract: Agriculture products are being more demanding in
market today. To increase its productivity, automation to produce
these products will be very helpful. The purpose of this work is to
measure and determine the ripeness and quality of watermelon. The
textures on watermelon skin will be captured using digital camera.
These images will be filtered using image processing technique. All
these information gathered will be trained using ANN to determine
the watermelon ripeness accuracy. Initial results showed that the best
model has produced percentage accuracy of 86.51%, when measured
at 32 hidden units with a balanced percentage rate of training dataset.
Abstract: A number of competing methodologies have been developed
to identify genes and classify DNA sequences into coding
and non-coding sequences. This classification process is fundamental
in gene finding and gene annotation tools and is one of the most
challenging tasks in bioinformatics and computational biology. An
information theory measure based on mutual information has shown
good accuracy in classifying DNA sequences into coding and noncoding.
In this paper we describe a species independent iterative
approach that distinguishes coding from non-coding sequences using
the mutual information measure (MIM). A set of sixty prokaryotes is
used to extract universal training data. To facilitate comparisons with
the published results of other researchers, a test set of 51 bacterial
and archaeal genomes was used to evaluate MIM. These results
demonstrate that MIM produces superior results while remaining
species independent.
Abstract: Financial forecasting is an example of signal processing problems. A number of ways to train/learn the network are available. We have used Levenberg-Marquardt algorithm for error back-propagation for weight adjustment. Pre-processing of data has reduced much of the variation at large scale to small scale, reducing the variation of training data.
Abstract: In this study, the Multi-Layer Perceptron (MLP)with Back-Propagation learning algorithm are used to classify to effective diagnosis Parkinsons disease(PD).It-s a challenging problem for medical community.Typically characterized by tremor, PD occurs due to the loss of dopamine in the brains thalamic region that results in involuntary or oscillatory movement in the body. A feature selection algorithm along with biomedical test values to diagnose Parkinson disease.Clinical diagnosis is done mostly by doctor-s expertise and experience.But still cases are reported of wrong diagnosis and treatment. Patients are asked to take number of tests for diagnosis.In many cases,not all the tests contribute towards effective diagnosis of a disease.Our work is to classify the presence of Parkinson disease with reduced number of attributes.Original,22 attributes are involved in classify.We use Information Gain to determine the attributes which reduced the number of attributes which is need to be taken from patients.The Artificial neural networks is used to classify the diagnosis of patients.Twenty-Two attributes are reduced to sixteen attributes.The accuracy is in training data set is 82.051% and in the validation data set is 83.333%.
Abstract: The nature of consumer products causes the difficulty
in forecasting the future demands and the accuracy of the forecasts
significantly affects the overall performance of the supply chain
system. In this study, two data mining methods, artificial neural
network (ANN) and support vector machine (SVM), were utilized to
predict the demand of consumer products. The training data used was
the actual demand of six different products from a consumer product
company in Thailand. The results indicated that SVM had a better
forecast quality (in term of MAPE) than ANN in every category of
products. Moreover, another important finding was the margin
difference of MAPE from these two methods was significantly high
when the data was highly correlated.
Abstract: The self-organizing map (SOM) model is a well-known neural network model with wide spread of applications. The main characteristics of SOM are two-fold, namely dimension reduction and topology preservation. Using SOM, a high-dimensional data space will be mapped to some low-dimensional space. Meanwhile, the topological relations among data will be preserved. With such characteristics, the SOM was usually applied on data clustering and visualization tasks. However, the SOM has main disadvantage of the need to know the number and structure of neurons prior to training, which are difficult to be determined. Several schemes have been proposed to tackle such deficiency. Examples are growing/expandable SOM, hierarchical SOM, and growing hierarchical SOM. These schemes could dynamically expand the map, even generate hierarchical maps, during training. Encouraging results were reported. Basically, these schemes adapt the size and structure of the map according to the distribution of training data. That is, they are data-driven or dataoriented SOM schemes. In this work, a topic-oriented SOM scheme which is suitable for document clustering and organization will be developed. The proposed SOM will automatically adapt the number as well as the structure of the map according to identified topics. Unlike other data-oriented SOMs, our approach expands the map and generates the hierarchies both according to the topics and their characteristics of the neurons. The preliminary experiments give promising result and demonstrate the plausibility of the method.
Abstract: In this paper we describe a hybrid technique of Minimax search and aggregate Mahalanobis distance function synthesis to evolve Awale game player. The hybrid technique helps to suggest a move in a short amount of time without looking into endgame database. However, the effectiveness of the technique is heavily dependent on the training dataset of the Awale strategies utilized. The evolved player was tested against Awale shareware program and the result is appealing.
Abstract: Obtaining labeled data in supervised learning is often
difficult and expensive, and thus the trained learning algorithm tends
to be overfitting due to small number of training data. As a result,
some researchers have focused on using unlabeled data which may
not necessary to follow the same generative distribution as the labeled
data to construct a high-level feature for improving performance on
supervised learning tasks. In this paper, we investigate the impact of
the relationship between unlabeled and labeled data for classification
performance. Specifically, we will apply difference unlabeled data
which have different degrees of relation to the labeled data for
handwritten digit classification task based on MNIST dataset. Our
experimental results show that the higher the degree of relation
between unlabeled and labeled data, the better the classification
performance. Although the unlabeled data that is completely from
different generative distribution to the labeled data provides the lowest
classification performance, we still achieve high classification performance.
This leads to expanding the applicability of the supervised
learning algorithms using unsupervised learning.
Abstract: Automatic face detection is a complex problem in
image processing. Many methods exist to solve this problem such as
template matching, Fisher Linear Discriminate, Neural Networks,
SVM, and MRC. Success has been achieved with each method to
varying degrees and complexities. In proposed algorithm we used
upright, frontal faces for single gray scale images with decent
resolution and under good lighting condition. In the field of face
recognition technique the single face is matched with single face
from the training dataset. The author proposed a neural network
based face detection algorithm from the photographs as well as if any
test data appears it check from the online scanned training dataset.
Experimental result shows that the algorithm detected up to 95%
accuracy for any image.
Abstract: In this paper, a neural network technique is applied to
real-time classifying media while a projectile is penetrating through
them. A laboratory-scaled penetrating setup was built for the
experiment. Features used as the network inputs were extracted from
the acceleration of penetrator. 6000 set of features from a single
penetration with known media and status were used to train the neural
network. The trained system was tested on 30 different penetration
experiments. The system produced an accuracy of 100% on the
training data set. And, their precision could be 99% for the test data
from 30 tests.
Abstract: An additive fuzzy system comprising m rules with
n inputs and p outputs in each rule has at least t m(2n + 2 p + 1)
parameters needing to be tuned. The system consists of a large
number of if-then fuzzy rules and takes a long time to tune its
parameters especially in the case of a large amount of training data
samples. In this paper, a new learning strategy is investigated to cope
with this obstacle. Parameters that tend toward constant values at the
learning process are initially fixed and they are not tuned till the end
of the learning time. Experiments based on applications of the
additive fuzzy system in function approximation demonstrate that the
proposed approach reduces the learning time and hence improves
convergence speed considerably.
Abstract: In this paper, a new learning approach for network
intrusion detection using naïve Bayesian classifier and ID3 algorithm
is presented, which identifies effective attributes from the training
dataset, calculates the conditional probabilities for the best attribute
values, and then correctly classifies all the examples of training and
testing dataset. Most of the current intrusion detection datasets are
dynamic, complex and contain large number of attributes. Some of
the attributes may be redundant or contribute little for detection
making. It has been successfully tested that significant attribute
selection is important to design a real world intrusion detection
systems (IDS). The purpose of this study is to identify effective
attributes from the training dataset to build a classifier for network
intrusion detection using data mining algorithms. The experimental
results on KDD99 benchmark intrusion detection dataset demonstrate
that this new approach achieves high classification rates and reduce
false positives using limited computational resources.
Abstract: Support vector machines (SVMs) are considered to be
the best machine learning algorithms for minimizing the predictive
probability of misclassification. However, their drawback is that for
large data sets the computation of the optimal decision boundary is a
time consuming function of the size of the training set. Hence several
methods have been proposed to speed up the SVM algorithm. Here
three methods used to speed up the computation of the SVM
classifiers are compared experimentally using a musical genre
classification problem. The simplest method pre-selects a random
sample of the data before the application of the SVM algorithm. Two
additional methods use proximity graphs to pre-select data that are
near the decision boundary. One uses k-Nearest Neighbor graphs and
the other Relative Neighborhood Graphs to accomplish the task.
Abstract: Hybrid algorithm is the hot issue in Computational
Intelligence (CI) study. From in-depth discussion on Simulation
Mechanism Based (SMB) classification method and composite patterns,
this paper presents the Mamdani model based Adaptive Neural
Fuzzy Inference System (M-ANFIS) and weight updating formula in
consideration with qualitative representation of inference consequent
parts in fuzzy neural networks. M-ANFIS model adopts Mamdani
fuzzy inference system which has advantages in consequent part.
Experiment results of applying M-ANFIS to evaluate traffic Level
of service show that M-ANFIS, as a new hybrid algorithm in computational
intelligence, has great advantages in non-linear modeling,
membership functions in consequent parts, scale of training data and
amount of adjusted parameters.
Abstract: Heterogeneity of solid waste characteristics as well as the complex processes taking place within the landfill ecosystem motivated the implementation of soft computing methodologies such as artificial neural networks (ANN), fuzzy logic (FL), and their combination. The present work uses a hybrid ANN-FL model that employs knowledge-based FL to describe the process qualitatively and implements the learning algorithm of ANN to optimize model parameters. The model was developed to simulate and predict the landfill gas production at a given time based on operational parameters. The experimental data used were compiled from lab-scale experiment that involved various operating scenarios. The developed model was validated and statistically analyzed using F-test, linear regression between actual and predicted data, and mean squared error measures. Overall, the simulated landfill gas production rates demonstrated reasonable agreement with actual data. The discussion focused on the effect of the size of training datasets and number of training epochs.
Abstract: In this paper we designed and implemented a new
ensemble of classifiers based on a sequence of classifiers which were
specialized in regions of the training dataset where errors of its
trained homologous are concentrated. In order to separate this
regions, and to determine the aptitude of each classifier to properly
respond to a new case, it was used another set of classifiers built
hierarchically. We explored a selection based variant to combine the
base classifiers. We validated this model with different base
classifiers using 37 training datasets. It was carried out a statistical
comparison of these models with the well known Bagging and
Boosting, obtaining significantly superior results with the
hierarchical ensemble using Multilayer Perceptron as base classifier.
Therefore, we demonstrated the efficacy of the proposed ensemble,
as well as its applicability to general problems.
Abstract: Surface roughness (Ra) is one of the most important requirements in machining process. In order to obtain better surface roughness, the proper setting of cutting parameters is crucial before the process take place. This research presents the development of mathematical model for surface roughness prediction before milling process in order to evaluate the fitness of machining parameters; spindle speed, feed rate and depth of cut. 84 samples were run in this study by using FANUC CNC Milling α-Τ14ιE. Those samples were randomly divided into two data sets- the training sets (m=60) and testing sets(m=24). ANOVA analysis showed that at least one of the population regression coefficients was not zero. Multiple Regression Method was used to determine the correlation between a criterion variable and a combination of predictor variables. It was established that the surface roughness is most influenced by the feed rate. By using Multiple Regression Method equation, the average percentage deviation of the testing set was 9.8% and 9.7% for training data set. This showed that the statistical model could predict the surface roughness with about 90.2% accuracy of the testing data set and 90.3% accuracy of the training data set.