Prediction of Protein Subchloroplast Locations using Random Forests

Protein subchloroplast locations are correlated with its functions. In contrast to the large amount of available protein sequences, the information of their locations and functions is less known. The experiment works for identification of protein locations and functions are costly and time consuming. The accurate prediction of protein subchloroplast locations can accelerate the study of functions of proteins in chloroplast. This study proposes a Random Forest based method, ChloroRF, to predict protein subchloroplast locations using interpretable physicochemical properties. In addition to high prediction accuracy, the ChloroRF is able to select important physicochemical properties. The important physicochemical properties are also analyzed to provide insights into the underlying mechanism.

Predicting Protein-Protein Interactions from Protein Sequences Using Phylogenetic Profiles

In this study, a high accuracy protein-protein interaction prediction method is developed. The importance of the proposed method is that it only uses sequence information of proteins while predicting interaction. The method extracts phylogenetic profiles of proteins by using their sequence information. Combining the phylogenetic profiles of two proteins by checking existence of homologs in different species and fitting this combined profile into a statistical model, it is possible to make predictions about the interaction status of two proteins. For this purpose, we apply a collection of pattern recognition techniques on the dataset of combined phylogenetic profiles of protein pairs. Support Vector Machines, Feature Extraction using ReliefF, Naive Bayes Classification, K-Nearest Neighborhood Classification, Decision Trees, and Random Forest Classification are the methods we applied for finding the classification method that best predicts the interaction status of protein pairs. Random Forest Classification outperformed all other methods with a prediction accuracy of 76.93%

A Study of Classification Models to Predict Drill-Bit Breakage Using Degradation Signals

Cutting tools are widely used in manufacturing processes and drilling is the most commonly used machining process. Although drill-bits used in drilling may not be expensive, their breakage can cause damage to expensive work piece being drilled and at the same time has major impact on productivity. Predicting drill-bit breakage, therefore, is important in reducing cost and improving productivity. This study uses twenty features extracted from two degradation signals viz., thrust force and torque. The methodology used involves developing and comparing decision tree, random forest, and multinomial logistic regression models for classifying and predicting drill-bit breakage using degradation signals.

Ensembling Classifiers – An Application toImage Data Classification from Cherenkov Telescope Experiment

Ensemble learning algorithms such as AdaBoost and Bagging have been in active research and shown improvements in classification results for several benchmarking data sets with mainly decision trees as their base classifiers. In this paper we experiment to apply these Meta learning techniques with classifiers such as random forests, neural networks and support vector machines. The data sets are from MAGIC, a Cherenkov telescope experiment. The task is to classify gamma signals from overwhelmingly hadron and muon signals representing a rare class classification problem. We compare the individual classifiers with their ensemble counterparts and discuss the results. WEKA a wonderful tool for machine learning has been used for making the experiments.

Dynamic Features Selection for Heart Disease Classification

The healthcare environment is generally perceived as being information rich yet knowledge poor. However, there is a lack of effective analysis tools to discover hidden relationships and trends in data. In fact, valuable knowledge can be discovered from application of data mining techniques in healthcare system. In this study, a proficient methodology for the extraction of significant patterns from the Coronary Heart Disease warehouses for heart attack prediction, which unfortunately continues to be a leading cause of mortality in the whole world, has been presented. For this purpose, we propose to enumerate dynamically the optimal subsets of the reduced features of high interest by using rough sets technique associated to dynamic programming. Therefore, we propose to validate the classification using Random Forest (RF) decision tree to identify the risky heart disease cases. This work is based on a large amount of data collected from several clinical institutions based on the medical profile of patient. Moreover, the experts- knowledge in this field has been taken into consideration in order to define the disease, its risk factors, and to establish significant knowledge relationships among the medical factors. A computer-aided system is developed for this purpose based on a population of 525 adults. The performance of the proposed model is analyzed and evaluated based on set of benchmark techniques applied in this classification problem.

Traffic Flow Prediction using Adaboost Algorithm with Random Forests as a Weak Learner

Traffic Management and Information Systems, which rely on a system of sensors, aim to describe in real-time traffic in urban areas using a set of parameters and estimating them. Though the state of the art focuses on data analysis, little is done in the sense of prediction. In this paper, we describe a machine learning system for traffic flow management and control for a prediction of traffic flow problem. This new algorithm is obtained by combining Random Forests algorithm into Adaboost algorithm as a weak learner. We show that our algorithm performs relatively well on real data, and enables, according to the Traffic Flow Evaluation model, to estimate and predict whether there is congestion or not at a given time on road intersections.

Meta Random Forests

Leo Breimans Random Forests (RF) is a recent development in tree based classifiers and quickly proven to be one of the most important algorithms in the machine learning literature. It has shown robust and improved results of classifications on standard data sets. Ensemble learning algorithms such as AdaBoost and Bagging have been in active research and shown improvements in classification results for several benchmarking data sets with mainly decision trees as their base classifiers. In this paper we experiment to apply these Meta learning techniques to the random forests. We experiment the working of the ensembles of random forests on the standard data sets available in UCI data sets. We compare the original random forest algorithm with their ensemble counterparts and discuss the results.

Detecting Email Forgery using Random Forests and Naïve Bayes Classifiers

As emails communications have no consistent authentication procedure to ensure the authenticity, we present an investigation analysis approach for detecting forged emails based on Random Forests and Naïve Bays classifiers. Instead of investigating the email headers, we use the body content to extract a unique writing style for all the possible suspects. Our approach consists of four main steps: (1) The cybercrime investigator extract different effective features including structural, lexical, linguistic, and syntactic evidence from previous emails for all the possible suspects, (2) The extracted features vectors are normalized to increase the accuracy rate. (3) The normalized features are then used to train the learning engine, (4) upon receiving the anonymous email (M); we apply the feature extraction process to produce a feature vector. Finally, using the machine learning classifiers the email is assigned to one of the suspects- whose writing style closely matches M. Experimental results on real data sets show the improved performance of the proposed method and the ability of identifying the authors with a very limited number of features.

Using Fractional Factorial Designs for Variable Importance in Random Forest Models

Random Forests are a powerful classification technique, consisting of a collection of decision trees. One useful feature of Random Forests is the ability to determine the importance of each variable in predicting the outcome. This is done by permuting each variable and computing the change in prediction accuracy before and after the permutation. This variable importance calculation is similar to a one-factor-at a time experiment and therefore is inefficient. In this paper, we use a regular fractional factorial design to determine which variables to permute. Based on the results of the trials in the experiment, we calculate the individual importance of the variables, with improved precision over the standard method. The method is illustrated with a study of student attrition at Monash University.

Data Mining Classification Methods Applied in Drug Design

Data mining incorporates a group of statistical methods used to analyze a set of information, or a data set. It operates with models and algorithms, which are powerful tools with the great potential. They can help people to understand the patterns in certain chunk of information so it is obvious that the data mining tools have a wide area of applications. For example in the theoretical chemistry data mining tools can be used to predict moleculeproperties or improve computer-assisted drug design. Classification analysis is one of the major data mining methodologies. The aim of thecontribution is to create a classification model, which would be able to deal with a huge data set with high accuracy. For this purpose logistic regression, Bayesian logistic regression and random forest models were built using R software. TheBayesian logistic regression in Latent GOLD software was created as well. These classification methods belong to supervised learning methods. It was necessary to reduce data matrix dimension before construct models and thus the factor analysis (FA) was used. Those models were applied to predict the biological activity of molecules, potential new drug candidates.

Evaluation of the Impact of Dataset Characteristics for Classification Problems in Biological Applications

Availability of high dimensional biological datasets such as from gene expression, proteomic, and metabolic experiments can be leveraged for the diagnosis and prognosis of diseases. Many classification methods in this area have been studied to predict disease states and separate between predefined classes such as patients with a special disease versus healthy controls. However, most of the existing research only focuses on a specific dataset. There is a lack of generic comparison between classifiers, which might provide a guideline for biologists or bioinformaticians to select the proper algorithm for new datasets. In this study, we compare the performance of popular classifiers, which are Support Vector Machine (SVM), Logistic Regression, k-Nearest Neighbor (k-NN), Naive Bayes, Decision Tree, and Random Forest based on mock datasets. We mimic common biological scenarios simulating various proportions of real discriminating biomarkers and different effect sizes thereof. The result shows that SVM performs quite stable and reaches a higher AUC compared to other methods. This may be explained due to the ability of SVM to minimize the probability of error. Moreover, Decision Tree with its good applicability for diagnosis and prognosis shows good performance in our experimental setup. Logistic Regression and Random Forest, however, strongly depend on the ratio of discriminators and perform better when having a higher number of discriminators.