Computing Entropy for Ortholog Detection

Biological sequences from different species are called or-thologs if they evolved from a sequence of a common ancestor species and they have the same biological function. Approximations of Kolmogorov complexity or entropy of biological sequences are already well known to be useful in extracting similarity information between such sequences -in the interest, for example, of ortholog detection. As is well known, the exact Kolmogorov complexity is not algorithmically computable. In prac-tice one can approximate it by computable compression methods. How-ever, such compression methods do not provide a good approximation to Kolmogorov complexity for short sequences. Herein is suggested a new ap-proach to overcome the problem that compression approximations may notwork well on short sequences. This approach is inspired by new, conditional computations of Kolmogorov entropy. A main contribution of the empir-ical work described shows the new set of entropy-based machine learning attributes provides good separation between positive (ortholog) and nega-tive (non-ortholog) data - better than with good, previously known alter-natives (which do not employ some means to handle short sequences well).Also empirically compared are the new entropy based attribute set and a number of other, more standard similarity attributes sets commonly used in genomic analysis. The various similarity attributes are evaluated by cross validation, through boosted decision tree induction C5.0, and by Receiver Operating Characteristic (ROC) analysis. The results point to the conclu-sion: the new, entropy based attribute set by itself is not the one giving the best prediction; however, it is the best attribute set for use in improving the other, standard attribute sets when conjoined with them.

Fusion of ETM+ Multispectral and Panchromatic Texture for Remote Sensing Classification

This paper proposes to use ETM+ multispectral data and panchromatic band as well as texture features derived from the panchromatic band for land cover classification. Four texture features including one 'internal texture' and three GLCM based textures namely correlation, entropy, and inverse different moment were used in combination with ETM+ multispectral data. Two data sets involving combination of multispectral, panchromatic band and its texture were used and results were compared with those obtained by using multispectral data alone. A decision tree classifier with and without boosting were used to classify different datasets. Results from this study suggest that the dataset consisting of panchromatic band, four of its texture features and multispectral data was able to increase the classification accuracy by about 2%. In comparison, a boosted decision tree was able to increase the classification accuracy by about 3% with the same dataset.

Predicting Protein-Protein Interactions from Protein Sequences Using Phylogenetic Profiles

In this study, a high accuracy protein-protein interaction prediction method is developed. The importance of the proposed method is that it only uses sequence information of proteins while predicting interaction. The method extracts phylogenetic profiles of proteins by using their sequence information. Combining the phylogenetic profiles of two proteins by checking existence of homologs in different species and fitting this combined profile into a statistical model, it is possible to make predictions about the interaction status of two proteins. For this purpose, we apply a collection of pattern recognition techniques on the dataset of combined phylogenetic profiles of protein pairs. Support Vector Machines, Feature Extraction using ReliefF, Naive Bayes Classification, K-Nearest Neighborhood Classification, Decision Trees, and Random Forest Classification are the methods we applied for finding the classification method that best predicts the interaction status of protein pairs. Random Forest Classification outperformed all other methods with a prediction accuracy of 76.93%

Automatic Reusability Appraisal of Software Components using Neuro-fuzzy Approach

Automatic reusability appraisal could be helpful in evaluating the quality of developed or developing reusable software components and in identification of reusable components from existing legacy systems; that can save cost of developing the software from scratch. But the issue of how to identify reusable components from existing systems has remained relatively unexplored. In this paper, we have mentioned two-tier approach by studying the structural attributes as well as usability or relevancy of the component to a particular domain. Latent semantic analysis is used for the feature vector representation of various software domains. It exploits the fact that FeatureVector codes can be seen as documents containing terms -the idenifiers present in the components- and so text modeling methods that capture co-occurrence information in low-dimensional spaces can be used. Further, we devised Neuro- Fuzzy hybrid Inference System, which takes structural metric values as input and calculates the reusability of the software component. Decision tree algorithm is used to decide initial set of fuzzy rules for the Neuro-fuzzy system. The results obtained are convincing enough to propose the system for economical identification and retrieval of reusable software components.

Integrating Context Priors into a Decision Tree Classification Scheme

Scene interpretation systems need to match (often ambiguous) low-level input data to concepts from a high-level ontology. In many domains, these decisions are uncertain and benefit greatly from proper context. This paper demonstrates the use of decision trees for estimating class probabilities for regions described by feature vectors, and shows how context can be introduced in order to improve the matching performance.

Prediction of Reusability of Object Oriented Software Systems using Clustering Approach

In literature, there are metrics for identifying the quality of reusable components but the framework that makes use of these metrics to precisely predict reusability of software components is still need to be worked out. These reusability metrics if identified in the design phase or even in the coding phase can help us to reduce the rework by improving quality of reuse of the software component and hence improve the productivity due to probabilistic increase in the reuse level. As CK metric suit is most widely used metrics for extraction of structural features of an object oriented (OO) software; So, in this study, tuned CK metric suit i.e. WMC, DIT, NOC, CBO and LCOM, is used to obtain the structural analysis of OO-based software components. An algorithm has been proposed in which the inputs can be given to K-Means Clustering system in form of tuned values of the OO software component and decision tree is formed for the 10-fold cross validation of data to evaluate the in terms of linguistic reusability value of the component. The developed reusability model has produced high precision results as desired.

Predicting Protein Function using Decision Tree

The drug discovery process starts with protein identification because proteins are responsible for many functions required for maintenance of life. Protein identification further needs determination of protein function. Proposed method develops a classifier for human protein function prediction. The model uses decision tree for classification process. The protein function is predicted on the basis of matched sequence derived features per each protein function. The research work includes the development of a tool which determines sequence derived features by analyzing different parameters. The other sequence derived features are determined using various web based tools.

Learning User Keystroke Patterns for Authentication

Keystroke authentication is a new access control system to identify legitimate users via their typing behavior. In this paper, machine learning techniques are adapted for keystroke authentication. Seven learning methods are used to build models to differentiate user keystroke patterns. The selected classification methods are Decision Tree, Naive Bayesian, Instance Based Learning, Decision Table, One Rule, Random Tree and K-star. Among these methods, three of them are studied in more details. The results show that machine learning is a feasible alternative for keystroke authentication. Compared to the conventional Nearest Neighbour method in the recent research, learning methods especially Decision Tree can be more accurate. In addition, the experiment results reveal that 3-Grams is more accurate than 2-Grams and 4-Grams for feature extraction. Also, combination of attributes tend to result higher accuracy.

Hybrid Machine Learning Approach for Text Categorization

Text categorization - the assignment of natural language documents to one or more predefined categories based on their semantic content - is an important component in many information organization and management tasks. Performance of neural networks learning is known to be sensitive to the initial weights and architecture. This paper discusses the use multilayer neural network initialization with decision tree classifier for improving text categorization accuracy. An adaptation of the algorithm is proposed in which a decision tree from root node until a final leave is used for initialization of multilayer neural network. The experimental evaluation demonstrates this approach provides better classification accuracy with Reuters-21578 corpus, one of the standard benchmarks for text categorization tasks. We present results comparing the accuracy of this approach with multilayer neural network initialized with traditional random method and decision tree classifiers.

Improving University Operations with Data Mining: Predicting Student Performance

The purpose of this paper is to develop models that would enable predicting student success. These models could improve allocation of students among colleges and optimize the newly introduced model of government subsidies for higher education. For the purpose of collecting data, an anonymous survey was carried out in the last year of undergraduate degree student population using random sampling method. Decision trees were created of which two have been chosen that were most successful in predicting student success based on two criteria: Grade Point Average (GPA) and time that a student needs to finish the undergraduate program (time-to-degree). Decision trees have been shown as a good method of classification student success and they could be even more improved by increasing survey sample and developing specialized decision trees for each type of college. These types of methods have a big potential for use in decision support systems.

Defect Cause Modeling with Decision Tree and Regression Analysis

The main aim of this study is to identify the most influential variables that cause defects on the items produced by a casting company located in Turkey. To this end, one of the items produced by the company with high defective percentage rates is selected. Two approaches-the regression analysis and decision treesare used to model the relationship between process parameters and defect types. Although logistic regression models failed, decision tree model gives meaningful results. Based on these results, it can be claimed that the decision tree approach is a promising technique for determining the most important process variables.

A Study of Classification Models to Predict Drill-Bit Breakage Using Degradation Signals

Cutting tools are widely used in manufacturing processes and drilling is the most commonly used machining process. Although drill-bits used in drilling may not be expensive, their breakage can cause damage to expensive work piece being drilled and at the same time has major impact on productivity. Predicting drill-bit breakage, therefore, is important in reducing cost and improving productivity. This study uses twenty features extracted from two degradation signals viz., thrust force and torque. The methodology used involves developing and comparing decision tree, random forest, and multinomial logistic regression models for classifying and predicting drill-bit breakage using degradation signals.

Evolutionary Decision Trees and Software Metrics for Module Defects Identification

Software metric is a measure of some property of a piece of software or its specification. The aim of this paper is to present an application of evolutionary decision trees in software engineering in order to classify the software modules that have or have not one or more reported defects. For this some metrics are used for detecting the class of modules with defects or without defects.

Application of Machine Learning Methods to Online Test Error Detection in Semiconductor Test

As in today's semiconductor industries test costs can make up to 50 percent of the total production costs, an efficient test error detection becomes more and more important. In this paper, we present a new machine learning approach to test error detection that should provide a faster recognition of test system faults as well as an improved test error recall. The key idea is to learn a classifier ensemble, detecting typical test error patterns in wafer test results immediately after finishing these tests. Since test error detection has not yet been discussed in the machine learning community, we define central problem-relevant terms and provide an analysis of important domain properties. Finally, we present comparative studies reflecting the failure detection performance of three individual classifiers and three ensemble methods based upon them. As base classifiers we chose a decision tree learner, a support vector machine and a Bayesian network, while the compared ensemble methods were simple and weighted majority vote as well as stacking. For the evaluation, we used cross validation and a specially designed practical simulation. By implementing our approach in a semiconductor test department for the observation of two products, we proofed its practical applicability.

Extraction of Symbolic Rules from Artificial Neural Networks

Although backpropagation ANNs generally predict better than decision trees do for pattern classification problems, they are often regarded as black boxes, i.e., their predictions cannot be explained as those of decision trees. In many applications, it is desirable to extract knowledge from trained ANNs for the users to gain a better understanding of how the networks solve the problems. A new rule extraction algorithm, called rule extraction from artificial neural networks (REANN) is proposed and implemented to extract symbolic rules from ANNs. A standard three-layer feedforward ANN is the basis of the algorithm. A four-phase training algorithm is proposed for backpropagation learning. Explicitness of the extracted rules is supported by comparing them to the symbolic rules generated by other methods. Extracted rules are comparable with other methods in terms of number of rules, average number of conditions for a rule, and predictive accuracy. Extensive experimental studies on several benchmarks classification problems, such as breast cancer, iris, diabetes, and season classification problems, demonstrate the effectiveness of the proposed approach with good generalization ability.

Ensembling Classifiers – An Application toImage Data Classification from Cherenkov Telescope Experiment

Ensemble learning algorithms such as AdaBoost and Bagging have been in active research and shown improvements in classification results for several benchmarking data sets with mainly decision trees as their base classifiers. In this paper we experiment to apply these Meta learning techniques with classifiers such as random forests, neural networks and support vector machines. The data sets are from MAGIC, a Cherenkov telescope experiment. The task is to classify gamma signals from overwhelmingly hadron and muon signals representing a rare class classification problem. We compare the individual classifiers with their ensemble counterparts and discuss the results. WEKA a wonderful tool for machine learning has been used for making the experiments.

Context-aware Recommender Systems using Data Mining Techniques

This study proposes a novel recommender system to provide the advertisements of context-aware services. Our proposed model is designed to apply a modified collaborative filtering (CF) algorithm with regard to the several dimensions for the personalization of mobile devices – location, time and the user-s needs type. In particular, we employ a classification rule to understand user-s needs type using a decision tree algorithm. In addition, we collect primary data from the mobile phone users and apply them to the proposed model to validate its effectiveness. Experimental results show that the proposed system makes more accurate and satisfactory advertisements than comparative systems.

Artificial Intelligence Support for Interferon Treatment Decision in Chronic Hepatitis B

Chronic hepatitis B can evolve to cirrhosis and liver cancer. Interferon is the only effective treatment, for carefully selected patients, but it is very expensive. Some of the selection criteria are based on liver biopsy, an invasive, costly and painful medical procedure. Therefore, developing efficient non-invasive selection systems, could be in the patients benefit and also save money. We investigated the possibility to create intelligent systems to assist the Interferon therapeutical decision, mainly by predicting with acceptable accuracy the results of the biopsy. We used a knowledge discovery in integrated medical data - imaging, clinical, and laboratory data. The resulted intelligent systems, tested on 500 patients with chronic hepatitis B, based on C5.0 decision trees and boosting, predict with 100% accuracy the results of the liver biopsy. Also, by integrating the other patients selection criteria, they offer a non-invasive support for the correct Interferon therapeutic decision. To our best knowledge, these decision systems outperformed all similar systems published in the literature, and offer a realistic opportunity to replace liver biopsy in this medical context.

Distributed Data-Mining by Probability-Based Patterns

In this paper a new method is suggested for distributed data-mining by the probability patterns. These patterns use decision trees and decision graphs. The patterns are cared to be valid, novel, useful, and understandable. Considering a set of functions, the system reaches to a good pattern or better objectives. By using the suggested method we will be able to extract the useful information from massive and multi-relational data bases.

Dynamic Features Selection for Heart Disease Classification

The healthcare environment is generally perceived as being information rich yet knowledge poor. However, there is a lack of effective analysis tools to discover hidden relationships and trends in data. In fact, valuable knowledge can be discovered from application of data mining techniques in healthcare system. In this study, a proficient methodology for the extraction of significant patterns from the Coronary Heart Disease warehouses for heart attack prediction, which unfortunately continues to be a leading cause of mortality in the whole world, has been presented. For this purpose, we propose to enumerate dynamically the optimal subsets of the reduced features of high interest by using rough sets technique associated to dynamic programming. Therefore, we propose to validate the classification using Random Forest (RF) decision tree to identify the risky heart disease cases. This work is based on a large amount of data collected from several clinical institutions based on the medical profile of patient. Moreover, the experts- knowledge in this field has been taken into consideration in order to define the disease, its risk factors, and to establish significant knowledge relationships among the medical factors. A computer-aided system is developed for this purpose based on a population of 525 adults. The performance of the proposed model is analyzed and evaluated based on set of benchmark techniques applied in this classification problem.