Abstract: Red blood cells (RBCs) are among the most
commonly and intensively studied type of blood cells in cell biology.
Anemia is a lack of RBCs is characterized by its level compared to
the normal hemoglobin level. In this study, a system based image
processing methodology was developed to localize and extract RBCs
from microscopic images. Also, the machine learning approach is
adopted to classify the localized anemic RBCs images. Several
textural and geometrical features are calculated for each extracted
RBCs. The training set of features was analyzed using principal
component analysis (PCA). With the proposed method, RBCs were
isolated in 4.3secondsfrom an image containing 18 to 27 cells. The
reasons behind using PCA are its low computation complexity and
suitability to find the most discriminating features which can lead to
accurate classification decisions. Our classifier algorithm yielded
accuracy rates of 100%, 99.99%, and 96.50% for K-nearest neighbor
(K-NN) algorithm, support vector machine (SVM), and neural
network RBFNN, respectively. Classification was evaluated in highly
sensitivity, specificity, and kappa statistical parameters. In
conclusion, the classification results were obtained within short time
period, and the results became better when PCA was used.
Abstract: Educational data mining is a specific data mining field applied to data originating from educational environments, it relies on different approaches to discover hidden knowledge from the available data. Among these approaches are machine learning techniques which are used to build a system that acquires learning from previous data. Machine learning can be applied to solve different regression, classification, clustering and optimization problems.
In our research, we propose a “Student Advisory Framework” that utilizes classification and clustering to build an intelligent system. This system can be used to provide pieces of consultations to a first year university student to pursue a certain education track where he/she will likely succeed in, aiming to decrease the high rate of academic failure among these students. A real case study in Cairo Higher Institute for Engineering, Computer Science and Management is presented using real dataset collected from 2000−2012.The dataset has two main components: pre-higher education dataset and first year courses results dataset. Results have proved the efficiency of the suggested framework.
Abstract: The exponential increase in the volume of medical image database has imposed new challenges to clinical routine in maintaining patient history, diagnosis, treatment and monitoring. With the advent of data mining and machine learning techniques it is possible to automate and/or assist physicians in clinical diagnosis. In this research a medical image classification framework using data mining techniques is proposed. It involves feature extraction, feature selection, feature discretization and classification. In the classification phase, the performance of the traditional kNN k nearest neighbor classifier is improved using a feature weighting scheme and a distance weighted voting instead of simple majority voting. Feature weights are calculated using the interestingness measures used in association rule mining. Experiments on the retinal fundus images show that the proposed framework improves the classification accuracy of traditional kNN from 78.57 % to 92.85 %.
Abstract: The goal of a network-based intrusion detection
system is to classify activities of network traffics into two major
categories: normal and attack (intrusive) activities. Nowadays, data
mining and machine learning plays an important role in many
sciences; including intrusion detection system (IDS) using both
supervised and unsupervised techniques. However, one of the
essential steps of data mining is feature selection that helps in
improving the efficiency, performance and prediction rate of
proposed approach. This paper applies unsupervised K-means
clustering algorithm with information gain (IG) for feature selection
and reduction to build a network intrusion detection system. For our
experimental analysis, we have used the new NSL-KDD dataset,
which is a modified dataset for KDDCup 1999 intrusion detection
benchmark dataset. With a split of 60.0% for the training set and the
remainder for the testing set, a 2 class classifications have been
implemented (Normal, Attack). Weka framework which is a java
based open source software consists of a collection of machine
learning algorithms for data mining tasks has been used in the testing
process. The experimental results show that the proposed approach is
very accurate with low false positive rate and high true positive rate
and it takes less learning time in comparison with using the full
features of the dataset with the same algorithm.
Abstract: Learning using labeled and unlabelled data has
received considerable amount of attention in the machine learning
community due its potential in reducing the need for expensive
labeled data. In this work we present a new method for combining
labeled and unlabeled data based on classifier ensembles. The model
we propose assumes each classifier in the ensemble observes the
input using different set of features. Classifiers are initially trained
using some labeled samples. The trained classifiers learn further
through labeling the unknown patterns using a teaching signals that is
generated using the decision of the classifier ensemble, i.e. the
classifiers self-supervise each other. Experiments on a set of object
images are presented. Our experiments investigate different classifier
models, different fusing techniques, different training sizes and
different input features. Experimental results reveal that the proposed
self-supervised ensemble learning approach reduces classification
error over the single classifier and the traditional ensemble classifier
approachs.
Abstract: In this paper we present a GP-based method for automatically evolve projections, so that data can be more easily classified in the projected spaces. At the same time, our approach can reduce dimensionality by constructing more relevant attributes. Fitness of each projection measures how easy is to classify the dataset after applying the projection. This is quickly computed by a Simple Linear Perceptron. We have tested our approach in three domains. The experiments show that it obtains good results, compared to other Machine Learning approaches, while reducing dimensionality in many cases.
Abstract: Hospital staff and managers are under pressure and
concerned for effective use and management of scarce resources. The
hospital admissions require many decisions that have complex and
uncertain consequences for hospital resource utilization and patient
flow. It is challenging to predict risk of admissions and length of stay
of a patient due to their vague nature. There is no method to capture
the vague definition of admission of a patient. Also, current methods
and tools used to predict patients at risk of admission fail to deal with
uncertainty in unplanned admission, LOS, patients- characteristics.
The main objective of this paper is to deal with uncertainty in
health system variables, and handles uncertain relationship among
variables. An introduction of machine learning techniques along with
statistical methods like Regression methods can be a proposed
solution approach to handle uncertainty in health system variables. A
model that adapts fuzzy methods to handle uncertain data and
uncertain relationships can be an efficient solution to capture the
vague definition of admission of a patient.
Abstract: For a given specific problem an efficient algorithm has been the matter of study. However, an alternative approach orthogonal to this approach comes out, which is called a reduction. In general for a given specific problem this reduction approach studies how to convert an original problem into subproblems. This paper proposes a formal modeling language to support this reduction approach in order to make a solver quickly. We show three examples from the wide area of learning problems. The benefit is a fast prototyping of algorithms for a given new problem. It is noted that our formal modeling language is not intend for providing an efficient notation for data mining application, but for facilitating a designer who develops solvers in machine learning.
Abstract: In this study, a fuzzy similarity approach for Arabic
web pages classification is presented. The approach uses a fuzzy
term-category relation by manipulating membership degree for the
training data and the degree value for a test web page. Six measures
are used and compared in this study. These measures include:
Einstein, Algebraic, Hamacher, MinMax, Special case fuzzy and
Bounded Difference approaches. These measures are applied and
compared using 50 different Arabic web pages. Einstein measure was
gave best performance among the other measures. An analysis of
these measures and concluding remarks are drawn in this study.
Abstract: Classification is an important topic in machine learning
and bioinformatics. Many datasets have been introduced for
classification tasks. A dataset contains multiple features, and the quality of features influences the classification accuracy of the dataset.
The power of classification for each feature differs. In this study, we
suggest the Classification Influence Index (CII) as an indicator of classification power for each feature. CII enables evaluation of the
features in a dataset and improved classification accuracy by transformation of the dataset. By conducting experiments using CII
and the k-nearest neighbor classifier to analyze real datasets, we confirmed that the proposed index provided meaningful improvement
of the classification accuracy.
Abstract: Combined therapy using Interferon and Ribavirin is the standard treatment in patients with chronic hepatitis C. However, the number of responders to this treatment is low, whereas its cost and side effects are high. Therefore, there is a clear need to predict patient’s response to the treatment based on clinical information to protect the patients from the bad drawbacks, Intolerable side effects and waste of money. Different machine learning techniques have been developed to fulfill this purpose. From these techniques are Associative Classification (AC) and Decision Tree (DT). The aim of this research is to compare the performance of these two techniques in the prediction of virological response to the standard treatment of HCV from clinical information. 200 patients treated with Interferon and Ribavirin; were analyzed using AC and DT. 150 cases had been used to train the classifiers and 50 cases had been used to test the classifiers. The experiment results showed that the two techniques had given acceptable results however the best accuracy for the AC reached 92% whereas for DT reached 80%.
Abstract: Estimates of temperature values at a specific time of day, from daytime and daily profiles, are needed for a number of environmental, ecological, agricultural and technical applications, ranging from natural hazards assessments, crop growth forecasting to design of solar energy systems. The scope of this research is to investigate the efficiency of data mining techniques in estimating minimum, maximum and mean temperature values. For this reason, a number of experiments have been conducted with well-known regression algorithms using temperature data from the city of Patras in Greece. The performance of these algorithms has been evaluated using standard statistical indicators, such as Correlation Coefficient, Root Mean Squared Error, etc.
Abstract: The information on the Web increases tremendously.
A number of search engines have been developed for searching Web
information and retrieving relevant documents that satisfy the
inquirers needs. Search engines provide inquirers irrelevant
documents among search results, since the search is text-based rather
than semantic-based. Information retrieval research area has
presented a number of approaches and methodologies such as
profiling, feedback, query modification, human-computer interaction,
etc for improving search results. Moreover, information retrieval has
employed artificial intelligence techniques and strategies such as
machine learning heuristics, tuning mechanisms, user and system
vocabularies, logical theory, etc for capturing user's preferences and
using them for guiding the search based on the semantic analysis
rather than syntactic analysis. Although a valuable improvement has
been recorded on search results, the survey has shown that still
search engines users are not really satisfied with their search results.
Using ontologies for semantic-based searching is likely the key
solution. Adopting profiling approach and using ontology base
characteristics, this work proposes a strategy for finding the exact
meaning of the query terms in order to retrieve relevant information
according to user needs. The evaluation of conducted experiments
has shown the effectiveness of the suggested methodology and
conclusion is presented.
Abstract: Until recently, researchers have developed various
tools and methodologies for effective clinical decision-making.
Among those decisions, chest pain diseases have been one of
important diagnostic issues especially in an emergency department. To
improve the ability of physicians in diagnosis, many researchers have
developed diagnosis intelligence by using machine learning and data
mining. However, most of the conventional methodologies have been
generally based on a single classifier for disease classification and
prediction, which shows moderate performance. This study utilizes an
ensemble strategy to combine multiple different classifiers to help
physicians diagnose chest pain diseases more accurately than ever.
Specifically the ensemble strategy is applied by using the integration
of decision trees, neural networks, and support vector machines. The
ensemble models are applied to real-world emergency data. This study
shows that the performance of the ensemble models is superior to each
of single classifiers.
Abstract: Financial forecasting using machine learning techniques has received great efforts in the last decide . In this ongoing work, we show how machine learning of graphical models will be able to infer a visualized causal interactions between different banks in the Saudi equities market. One important discovery from such learned causal graphs is how companies influence each other and to what extend. In this work, a set of graphical models named Gaussian graphical models with developed ensemble penalized feature selection methods that combine ; filtering method, wrapper method and a regularizer will be shown. A comparison between these different developed ensemble combinations will also be shown. The best ensemble method will be used to infer the causal relationships between banks in Saudi equities market.
Abstract: Feature selection has recently been the subject of intensive research in data mining, specially for datasets with a large number of attributes. Recent work has shown that feature selection can have a positive effect on the performance of machine learning algorithms. The success of many learning algorithms in their attempts to construct models of data, hinges on the reliable identification of a small set of highly predictive attributes. The inclusion of irrelevant, redundant and noisy attributes in the model building process phase can result in poor predictive performance and increased computation. In this paper, a novel feature search procedure that utilizes the Ant Colony Optimization (ACO) is presented. The ACO is a metaheuristic inspired by the behavior of real ants in their search for the shortest paths to food sources. It looks for optimal solutions by considering both local heuristics and previous knowledge. When applied to two different classification problems, the proposed algorithm achieved very promising results.
Abstract: Many factors affect the success of Machine Learning
(ML) on a given task. The representation and quality of the instance
data is first and foremost. If there is much irrelevant and redundant
information present or noisy and unreliable data, then knowledge
discovery during the training phase is more difficult. It is well known
that data preparation and filtering steps take considerable amount of
processing time in ML problems. Data pre-processing includes data
cleaning, normalization, transformation, feature extraction and
selection, etc. The product of data pre-processing is the final training
set. It would be nice if a single sequence of data pre-processing
algorithms had the best performance for each data set but this is not
happened. Thus, we present the most well know algorithms for each
step of data pre-processing so that one achieves the best performance
for their data set.
Abstract: The paper discusses the results obtained to predict
reinforcement in singly reinforced beam using Neural Net (NN),
Support Vector Machines (SVM-s) and Tree Based Models. Major
advantage of SVM-s over NN is of minimizing a bound on the
generalization error of model rather than minimizing a bound on
mean square error over the data set as done in NN. Tree Based
approach divides the problem into a small number of sub problems to
reach at a conclusion. Number of data was created for different
parameters of beam to calculate the reinforcement using limit state
method for creation of models and validation. The results from this
study suggest a remarkably good performance of tree based and
SVM-s models. Further, this study found that these two techniques
work well and even better than Neural Network methods. A
comparison of predicted values with actual values suggests a very
good correlation coefficient with all four techniques.
Abstract: Predicting protein-protein interactions represent a key step in understanding proteins functions. This is due to the fact that proteins usually work in context of other proteins and rarely function alone. Machine learning techniques have been applied to predict protein-protein interactions. However, most of these techniques address this problem as a binary classification problem. Although it is easy to get a dataset of interacting proteins as positive examples, there are no experimentally confirmed non-interacting proteins to be considered as negative examples. Therefore, in this paper we solve this problem as a one-class classification problem using one-class support vector machines (SVM). Using only positive examples (interacting protein pairs) in training phase, the one-class SVM achieves accuracy of about 80%. These results imply that protein-protein interaction can be predicted using one-class classifier with comparable accuracy to the binary classifiers that use artificially constructed negative examples.
Abstract: Biclustering is a very useful data mining technique for
identifying patterns where different genes are co-related based on a
subset of conditions in gene expression analysis. Association rules
mining is an efficient approach to achieve biclustering as in
BIMODULE algorithm but it is sensitive to the value given to its
input parameters and the discretization procedure used in the
preprocessing step, also when noise is present, classical association
rules miners discover multiple small fragments of the true bicluster,
but miss the true bicluster itself. This paper formally presents a
generalized noise tolerant bicluster model, termed as μBicluster. An
iterative algorithm termed as BIDENS based on the proposed model
is introduced that can discover a set of k possibly overlapping
biclusters simultaneously. Our model uses a more flexible method to
partition the dimensions to preserve meaningful and significant
biclusters. The proposed algorithm allows discovering biclusters that
hard to be discovered by BIMODULE. Experimental study on yeast,
human gene expression data and several artificial datasets shows that
our algorithm offers substantial improvements over several
previously proposed biclustering algorithms.