Abstract: Protein subchloroplast locations are correlated with its
functions. In contrast to the large amount of available protein
sequences, the information of their locations and functions is less
known. The experiment works for identification of protein locations
and functions are costly and time consuming. The accurate prediction
of protein subchloroplast locations can accelerate the study of
functions of proteins in chloroplast. This study proposes a Random
Forest based method, ChloroRF, to predict protein subchloroplast
locations using interpretable physicochemical properties. In addition
to high prediction accuracy, the ChloroRF is able to select important
physicochemical properties. The important physicochemical
properties are also analyzed to provide insights into the underlying
mechanism.
Abstract: In this study, a high accuracy protein-protein interaction
prediction method is developed. The importance of the proposed
method is that it only uses sequence information of proteins while
predicting interaction. The method extracts phylogenetic profiles of
proteins by using their sequence information. Combining the phylogenetic
profiles of two proteins by checking existence of homologs
in different species and fitting this combined profile into a statistical
model, it is possible to make predictions about the interaction status
of two proteins.
For this purpose, we apply a collection of pattern recognition
techniques on the dataset of combined phylogenetic profiles of protein
pairs. Support Vector Machines, Feature Extraction using ReliefF,
Naive Bayes Classification, K-Nearest Neighborhood Classification,
Decision Trees, and Random Forest Classification are the methods
we applied for finding the classification method that best predicts
the interaction status of protein pairs. Random Forest Classification
outperformed all other methods with a prediction accuracy of 76.93%
Abstract: Cutting tools are widely used in manufacturing processes and drilling is the most commonly used machining process. Although drill-bits used in drilling may not be expensive, their breakage can cause damage to expensive work piece being drilled and at the same time has major impact on productivity. Predicting drill-bit breakage, therefore, is important in reducing cost and improving productivity. This study uses twenty features extracted from two degradation signals viz., thrust force and torque. The methodology used involves developing and comparing decision tree, random forest, and multinomial logistic regression models for classifying and predicting drill-bit breakage using degradation signals.
Abstract: Ensemble learning algorithms such as AdaBoost and
Bagging have been in active research and shown improvements in
classification results for several benchmarking data sets with mainly
decision trees as their base classifiers. In this paper we experiment to
apply these Meta learning techniques with classifiers such as random
forests, neural networks and support vector machines. The data sets
are from MAGIC, a Cherenkov telescope experiment. The task is to
classify gamma signals from overwhelmingly hadron and muon
signals representing a rare class classification problem. We compare
the individual classifiers with their ensemble counterparts and
discuss the results. WEKA a wonderful tool for machine learning has
been used for making the experiments.
Abstract: The healthcare environment is generally perceived as
being information rich yet knowledge poor. However, there is a lack
of effective analysis tools to discover hidden relationships and trends
in data. In fact, valuable knowledge can be discovered from
application of data mining techniques in healthcare system. In this
study, a proficient methodology for the extraction of significant
patterns from the Coronary Heart Disease warehouses for heart
attack prediction, which unfortunately continues to be a leading cause
of mortality in the whole world, has been presented. For this purpose,
we propose to enumerate dynamically the optimal subsets of the
reduced features of high interest by using rough sets technique
associated to dynamic programming. Therefore, we propose to
validate the classification using Random Forest (RF) decision tree to
identify the risky heart disease cases. This work is based on a large
amount of data collected from several clinical institutions based on
the medical profile of patient. Moreover, the experts- knowledge in
this field has been taken into consideration in order to define the
disease, its risk factors, and to establish significant knowledge
relationships among the medical factors. A computer-aided system is
developed for this purpose based on a population of 525 adults. The
performance of the proposed model is analyzed and evaluated based
on set of benchmark techniques applied in this classification problem.
Abstract: Traffic Management and Information Systems, which rely on a system of sensors, aim to describe in real-time traffic in urban areas using a set of parameters and estimating them. Though the state of the art focuses on data analysis, little is done in the sense of prediction. In this paper, we describe a machine learning system for traffic flow management and control for a prediction of traffic flow problem. This new algorithm is obtained by combining Random Forests algorithm into Adaboost algorithm as a weak learner. We show that our algorithm performs relatively well on real data, and enables, according to the Traffic Flow Evaluation model, to estimate and predict whether there is congestion or not at a given time on road intersections.
Abstract: Leo Breimans Random Forests (RF) is a recent
development in tree based classifiers and quickly proven to be one of
the most important algorithms in the machine learning literature. It
has shown robust and improved results of classifications on standard
data sets. Ensemble learning algorithms such as AdaBoost and
Bagging have been in active research and shown improvements in
classification results for several benchmarking data sets with mainly
decision trees as their base classifiers. In this paper we experiment to
apply these Meta learning techniques to the random forests. We
experiment the working of the ensembles of random forests on the
standard data sets available in UCI data sets. We compare the
original random forest algorithm with their ensemble counterparts
and discuss the results.
Abstract: As emails communications have no consistent
authentication procedure to ensure the authenticity, we present an
investigation analysis approach for detecting forged emails based on
Random Forests and Naïve Bays classifiers. Instead of investigating
the email headers, we use the body content to extract a unique writing
style for all the possible suspects. Our approach consists of four main
steps: (1) The cybercrime investigator extract different effective
features including structural, lexical, linguistic, and syntactic
evidence from previous emails for all the possible suspects, (2) The
extracted features vectors are normalized to increase the accuracy
rate. (3) The normalized features are then used to train the learning
engine, (4) upon receiving the anonymous email (M); we apply the
feature extraction process to produce a feature vector. Finally, using
the machine learning classifiers the email is assigned to one of the
suspects- whose writing style closely matches M. Experimental
results on real data sets show the improved performance of the
proposed method and the ability of identifying the authors with a
very limited number of features.
Abstract: Random Forests are a powerful classification technique, consisting of a collection of decision trees. One useful feature of Random Forests is the ability to determine the importance of each variable in predicting the outcome. This is done by permuting each variable and computing the change in prediction accuracy before and after the permutation. This variable importance calculation is similar to a one-factor-at a time experiment and therefore is inefficient. In this paper, we use a regular fractional factorial design to determine which variables to permute. Based on the results of the trials in the experiment, we calculate the individual importance of the variables, with improved precision over the standard method. The method is illustrated with a study of student attrition at Monash University.
Abstract: Data mining incorporates a group of statistical
methods used to analyze a set of information, or a data set. It operates
with models and algorithms, which are powerful tools with the great
potential. They can help people to understand the patterns in certain
chunk of information so it is obvious that the data mining tools have
a wide area of applications. For example in the theoretical chemistry
data mining tools can be used to predict moleculeproperties or
improve computer-assisted drug design. Classification analysis is one
of the major data mining methodologies. The aim of thecontribution
is to create a classification model, which would be able to deal with a
huge data set with high accuracy. For this purpose logistic regression,
Bayesian logistic regression and random forest models were built
using R software. TheBayesian logistic regression in Latent GOLD
software was created as well. These classification methods belong to
supervised learning methods.
It was necessary to reduce data matrix dimension before construct
models and thus the factor analysis (FA) was used. Those models
were applied to predict the biological activity of molecules, potential
new drug candidates.
Abstract: Availability of high dimensional biological datasets such as from gene expression, proteomic, and metabolic experiments can be leveraged for the diagnosis and prognosis of diseases. Many classification methods in this area have been studied to predict disease states and separate between predefined classes such as patients with a special disease versus healthy controls. However, most of the existing research only focuses on a specific dataset. There is a lack of generic comparison between classifiers, which might provide a guideline for biologists or bioinformaticians to select the proper algorithm for new datasets. In this study, we compare the performance of popular classifiers, which are Support Vector Machine (SVM), Logistic Regression, k-Nearest Neighbor (k-NN), Naive Bayes, Decision Tree, and Random Forest based on mock datasets. We mimic common biological scenarios simulating various proportions of real discriminating biomarkers and different effect sizes thereof. The result shows that SVM performs quite stable and reaches a higher AUC compared to other methods. This may be explained due to the ability of SVM to minimize the probability of error. Moreover, Decision Tree with its good applicability for diagnosis and prognosis shows good performance in our experimental setup. Logistic Regression and Random Forest, however, strongly depend on the ratio of discriminators and perform better when having a higher number of discriminators.