Abstract: This paper proposes a novel hybrid algorithm for feature selection based on a binary ant colony and SVM. The final subset selection is attained through the elimination of the features that produce noise or, are strictly correlated with other already selected features. Our algorithm can improve classification accuracy with a small and appropriate feature subset. Proposed algorithm is easily implemented and because of use of a simple filter in that, its computational complexity is very low. The performance of the proposed algorithm is evaluated through a real Rotary Cement kiln dataset. The results show that our algorithm outperforms existing algorithms.
Abstract: Association rules are an important problem in data
mining. Massively increasing volume of data in real life databases
has motivated researchers to design novel and incremental algorithms
for association rules mining. In this paper, we propose an incremental
association rules mining algorithm that integrates shocking
interestingness criterion during the process of building the model. A
new interesting measure called shocking measure is introduced. One
of the main features of the proposed approach is to capture the user
background knowledge, which is monotonically augmented. The
incremental model that reflects the changing data and the user beliefs
is attractive in order to make the over all KDD process more
effective and efficient. We implemented the proposed approach and
experiment it with some public datasets and found the results quite
promising.
Abstract: Feature selection has recently been the subject of intensive research in data mining, specially for datasets with a large number of attributes. Recent work has shown that feature selection can have a positive effect on the performance of machine learning algorithms. The success of many learning algorithms in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. In this paper, a novel feature search procedure that utilizes the Ant Colony Optimization (ACO) is presented. The ACO is a metaheuristic inspired by the behavior of real ants in their search for the shortest paths to food sources. It looks for optimal solutions by considering both local heuristics and previous knowledge. When applied to two different classification problems, the proposed algorithm achieved very promising results.
Abstract: This study proposes a new recommender system based on the collaborative folksonomy. The purpose of the proposed system is to recommend Internet resources (such as books, articles, documents, pictures, audio and video) to users. The proposed method includes four steps: creating the user profile based on the tags, grouping the similar users into clusters using an agglomerative hierarchical clustering, finding similar resources based on the user-s past collections by using content-based filtering, and recommending similar items to the target user. This study examines the system-s performance for the dataset collected from “del.icio.us," which is a famous social bookmarking website. Experimental results show that the proposed tag-based collaborative and content-based filtering hybridized recommender system is promising and effectiveness in the folksonomy-based bookmarking website.
Abstract: Text Mining is around applying knowledge discovery
techniques to unstructured text is termed knowledge discovery in text
(KDT), or Text data mining or Text Mining. In decision tree
approach is most useful in classification problem. With this
technique, tree is constructed to model the classification process.
There are two basic steps in the technique: building the tree and
applying the tree to the database. This paper describes a proposed
C5.0 classifier that performs rulesets, cross validation and boosting
for original C5.0 in order to reduce the optimization of error ratio.
The feasibility and the benefits of the proposed approach are
demonstrated by means of medial data set like hypothyroid. It is
shown that, the performance of a classifier on the training cases from
which it was constructed gives a poor estimate by sampling or using a
separate test file, either way, the classifier is evaluated on cases that
were not used to build and evaluate the classifier are both are large. If
the cases in hypothyroid.data and hypothyroid.test were to be
shuffled and divided into a new 2772 case training set and a 1000
case test set, C5.0 might construct a different classifier with a lower
or higher error rate on the test cases. An important feature of see5 is
its ability to classifiers called rulesets. The ruleset has an error rate
0.5 % on the test cases. The standard errors of the means provide an
estimate of the variability of results. One way to get a more reliable
estimate of predictive is by f-fold –cross- validation. The error rate of
a classifier produced from all the cases is estimated as the ratio of the
total number of errors on the hold-out cases to the total number of
cases. The Boost option with x trials instructs See5 to construct up to
x classifiers in this manner. Trials over numerous datasets, large and
small, show that on average 10-classifier boosting reduces the error
rate for test cases by about 25%.
Abstract: Shot boundary detection is a fundamental step for the organization of large video data. In this paper, we propose a new method for video gradual shots detection and classification, using advantages of fractal analysis and AIS-based classifier. Proposed features are “vertical intercept" and “fractal dimension" of each frame of videos which are computed using Fourier transform coefficients. We also used a classifier based on Clonal Selection Algorithm. We have carried out our solution and assessed it according to the TRECVID2006 benchmark dataset.
Abstract: Rule Discovery is an important technique for mining
knowledge from large databases. Use of objective measures for
discovering interesting rules leads to another data mining problem,
although of reduced complexity. Data mining researchers have
studied subjective measures of interestingness to reduce the volume
of discovered rules to ultimately improve the overall efficiency of
KDD process.
In this paper we study novelty of the discovered rules as a
subjective measure of interestingness. We propose a hybrid approach
based on both objective and subjective measures to quantify novelty
of the discovered rules in terms of their deviations from the known
rules (knowledge). We analyze the types of deviation that can arise
between two rules and categorize the discovered rules according to
the user specified threshold. We implement the proposed framework
and experiment with some public datasets. The experimental results
are promising.
Abstract: For the past one decade, biclustering has become popular data mining technique not only in the field of biological data analysis but also in other applications like text mining, market data analysis with high-dimensional two-way datasets. Biclustering clusters both rows and columns of a dataset simultaneously, as opposed to traditional clustering which clusters either rows or columns of a dataset. It retrieves subgroups of objects that are similar in one subgroup of variables and different in the remaining variables. Firefly Algorithm (FA) is a recently-proposed metaheuristic inspired by the collective behavior of fireflies. This paper provides a preliminary assessment of discrete version of FA (DFA) while coping with the task of mining coherent and large volume bicluster from web usage dataset. The experiments were conducted on two web usage datasets from public dataset repository whereby the performance of FA was compared with that exhibited by other population-based metaheuristic called binary Particle Swarm Optimization (PSO). The results achieved demonstrate the usefulness of DFA while tackling the biclustering problem.
Abstract: Diabetes Mellitus is a chronic metabolic disorder, where the improper management of the blood glucose level in the diabetic patients will lead to the risk of heart attack, kidney disease and renal failure. This paper attempts to enhance the diagnostic accuracy of the advancing blood glucose levels of the diabetic patients, by combining principal component analysis and wavelet neural network. The proposed system makes separate blood glucose prediction in the morning, afternoon, evening and night intervals, using dataset from one patient covering a period of 77 days. Comparisons of the diagnostic accuracy with other neural network models, which use the same dataset are made. The comparison results showed overall improved accuracy, which indicates the effectiveness of this proposed system.
Abstract: Analysis and visualization of microarraydata is veryassistantfor biologists and clinicians in the field of diagnosis and treatment of patients. It allows Clinicians to better understand the structure of microarray and facilitates understanding gene expression in cells. However, microarray dataset is a complex data set and has thousands of features and a very small number of observations. This very high dimensional data set often contains some noise, non-useful information and a small number of relevant features for disease or genotype. This paper proposes a non-linear dimensionality reduction algorithm Local Principal Component (LPC) which aims to maps high dimensional data to a lower dimensional space. The reduced data represents the most important variables underlying the original data. Experimental results and comparisons are presented to show the quality of the proposed algorithm. Moreover, experiments also show how this algorithm reduces high dimensional data whilst preserving the neighbourhoods of the points in the low dimensional space as in the high dimensional space.
Abstract: Predicting protein-protein interactions represent a key step in understanding proteins functions. This is due to the fact that proteins usually work in context of other proteins and rarely function alone. Machine learning techniques have been applied to predict protein-protein interactions. However, most of these techniques address this problem as a binary classification problem. Although it is easy to get a dataset of interacting proteins as positive examples, there are no experimentally confirmed non-interacting proteins to be considered as negative examples. Therefore, in this paper we solve this problem as a one-class classification problem using one-class support vector machines (SVM). Using only positive examples (interacting protein pairs) in training phase, the one-class SVM achieves accuracy of about 80%. These results imply that protein-protein interaction can be predicted using one-class classifier with comparable accuracy to the binary classifiers that use artificially constructed negative examples.
Abstract: Biclustering is a very useful data mining technique for
identifying patterns where different genes are co-related based on a
subset of conditions in gene expression analysis. Association rules
mining is an efficient approach to achieve biclustering as in
BIMODULE algorithm but it is sensitive to the value given to its
input parameters and the discretization procedure used in the
preprocessing step, also when noise is present, classical association
rules miners discover multiple small fragments of the true bicluster,
but miss the true bicluster itself. This paper formally presents a
generalized noise tolerant bicluster model, termed as μBicluster. An
iterative algorithm termed as BIDENS based on the proposed model
is introduced that can discover a set of k possibly overlapping
biclusters simultaneously. Our model uses a more flexible method to
partition the dimensions to preserve meaningful and significant
biclusters. The proposed algorithm allows discovering biclusters that
hard to be discovered by BIMODULE. Experimental study on yeast,
human gene expression data and several artificial datasets shows that
our algorithm offers substantial improvements over several
previously proposed biclustering algorithms.
Abstract: This paper presents the development of a Bayesian
belief network classifier for prediction of graft status and survival
period in renal transplantation using the patient profile information
prior to the transplantation. The objective was to explore feasibility
of developing a decision making tool for identifying the most suitable
recipient among the candidate pool members. The dataset was
compiled from the University of Toledo Medical Center Hospital
patients as reported to the United Network Organ Sharing, and had
1228 patient records for the period covering 1987 through 2009. The
Bayes net classifiers were developed using the Weka machine
learning software workbench. Two separate classifiers were induced
from the data set, one to predict the status of the graft as either failed
or living, and a second classifier to predict the graft survival period.
The classifier for graft status prediction performed very well with a
prediction accuracy of 97.8% and true positive values of 0.967 and
0.988 for the living and failed classes, respectively. The second
classifier to predict the graft survival period yielded a prediction
accuracy of 68.2% and a true positive rate of 0.85 for the class
representing those instances with kidneys failing during the first year
following transplantation. Simulation results indicated that it is
feasible to develop a successful Bayesian belief network classifier for
prediction of graft status, but not the graft survival period, using the
information in UNOS database.
Abstract: Data mining can be called as a technique to extract
information from data. It is the process of obtaining hidden
information and then turning it into qualified knowledge by statistical
and artificial intelligence technique. One of its application areas is
medical area to form decision support systems for diagnosis just by
inventing meaningful information from given medical data. In this
study a decision support system for diagnosis of illness that make use
of data mining and three different artificial intelligence classifier
algorithms namely Multilayer Perceptron, Naive Bayes Classifier and
J.48. Pima Indian dataset of UCI Machine Learning Repository was
used. This dataset includes urinary and blood test results of 768
patients. These test results consist of 8 different feature vectors.
Obtained classifying results were compared with the previous studies.
The suggestions for future studies were presented.
Abstract: Globalization and therefore increasing tight competition among companies, have resulted to increase the importance of making well-timed decision. Devising and employing effective strategies, that are flexible and adaptive to changing market, stand a greater chance of being effective in the long-term. In other side, a clear focus on managing the entire product lifecycle has emerged as critical areas for investment. Therefore, applying wellorganized tools to employ past experience in new case, helps to make proper and managerial decisions. Case based reasoning (CBR) is based on a means of solving a new problem by using or adapting solutions to old problems. In this paper, an adapted CBR model with k-nearest neighbor (K-NN) is employed to provide suggestions for better decision making which are adopted for a given product in the middle of life phase. The set of solutions are weighted by CBR in the principle of group decision making. Wrapper approach of genetic algorithm is employed to generate optimal feature subsets. The dataset of the department store, including various products which are collected among two years, have been used. K-fold approach is used to evaluate the classification accuracy rate. Empirical results are compared with classical case based reasoning algorithm which has no special process for feature selection, CBR-PCA algorithm based on filter approach feature selection, and Artificial Neural Network. The results indicate that the predictive performance of the model, compare with two CBR algorithms, in specific case is more effective.
Abstract: A predictive clustering hybrid regression (pCHR)
approach was developed and evaluated using dataset from H2-
producing sucrose-based bioreactor operated for 15 months. The aim
was to model and predict the H2-production rate using information
available about envirome and metabolome of the bioprocess. Selforganizing
maps (SOM) and Sammon map were used to visualize the
dataset and to identify main metabolic patterns and clusters in
bioprocess data. Three metabolic clusters: acetate coupled with other
metabolites, butyrate only, and transition phases were detected. The
developed pCHR model combines principles of k-means clustering,
kNN classification and regression techniques. The model performed
well in modeling and predicting the H2-production rate with mean
square error values of 0.0014 and 0.0032, respectively.
Abstract: Network security attacks are the violation of
information security policy that received much attention to the
computational intelligence society in the last decades. Data mining
has become a very useful technique for detecting network intrusions
by extracting useful knowledge from large number of network data
or logs. Naïve Bayesian classifier is one of the most popular data
mining algorithm for classification, which provides an optimal way
to predict the class of an unknown example. It has been tested that
one set of probability derived from data is not good enough to have
good classification rate. In this paper, we proposed a new learning
algorithm for mining network logs to detect network intrusions
through naïve Bayesian classifier, which first clusters the network
logs into several groups based on similarity of logs, and then
calculates the prior and conditional probabilities for each group of
logs. For classifying a new log, the algorithm checks in which cluster
the log belongs and then use that cluster-s probability set to classify
the new log. We tested the performance of our proposed algorithm by
employing KDD99 benchmark network intrusion detection dataset,
and the experimental results proved that it improves detection rates
as well as reduces false positives for different types of network
intrusions.
Abstract: Classifying biomedical literature is a difficult and
challenging task, especially when a large number of biomedical
articles should be organized into a hierarchical structure. In this paper,
we present an approach for classifying a collection of biomedical text
abstracts downloaded from Medline database with the help of
ontology alignment. To accomplish our goal, we construct two types
of hierarchies, the OHSUMED disease hierarchy and the Medline
abstract disease hierarchies from the OHSUMED dataset and the
Medline abstracts, respectively. Then, we enrich the OHSUMED
disease hierarchy before adapting it to ontology alignment process for
finding probable concepts or categories. Subsequently, we compute
the cosine similarity between the vector in probable concepts (in the
“enriched" OHSUMED disease hierarchy) and the vector in Medline
abstract disease hierarchies. Finally, we assign category to the new
Medline abstracts based on the similarity score. The results obtained
from the experiments show the performance of our proposed approach
for hierarchical classification is slightly better than the performance of
the multi-class flat classification.
Abstract: Economic freedoms, most emphasized issue in the recent years, are considered to affect economic growth and performance via institutional structure. In this context, a model that includes Turkey and Middle East Countries, and where the effects of economic freedom on growth are examined, was formed. For the groups of countries determined, in the study carried out by using the dataset belonging the period of 2004 - 2009, between economic freedoms and growth, a negative relationship was observed as group. In the sense of individual effects, it was identified that there was a positive relationship in terms of some Middle East Countries and Turkey.
Abstract: Intrusion detection is a mechanism used to protect a
system and analyse and predict the behaviours of system users. An
ideal intrusion detection system is hard to achieve due to
nonlinearity, and irrelevant or redundant features. This study
introduces a new anomaly-based intrusion detection model. The
suggested model is based on particle swarm optimisation and
nonlinear, multi-class and multi-kernel support vector machines.
Particle swarm optimisation is used for feature selection by applying
a new formula to update the position and the velocity of a particle;
the support vector machine is used as a classifier. The proposed
model is tested and compared with the other methods using the KDD
CUP 1999 dataset. The results indicate that this new method achieves
better accuracy rates than previous methods.