A Hybrid Feature Selection by Resampling, Chi squared and Consistency Evaluation Techniques

In this paper a combined feature selection method is proposed which takes advantages of sample domain filtering, resampling and feature subset evaluation methods to reduce dimensions of huge datasets and select reliable features. This method utilizes both feature space and sample domain to improve the process of feature selection and uses a combination of Chi squared with Consistency attribute evaluation methods to seek reliable features. This method consists of two phases. The first phase filters and resamples the sample domain and the second phase adopts a hybrid procedure to find the optimal feature space by applying Chi squared, Consistency subset evaluation methods and genetic search. Experiments on various sized datasets from UCI Repository of Machine Learning databases show that the performance of five classifiers (Naïve Bayes, Logistic, Multilayer Perceptron, Best First Decision Tree and JRIP) improves simultaneously and the classification error for these classifiers decreases considerably. The experiments also show that this method outperforms other feature selection methods.

Evaluating some Feature Selection Methods for an Improved SVM Classifier

Text categorization is the problem of classifying text documents into a set of predefined classes. After a preprocessing step the documents are typically represented as large sparse vectors. When training classifiers on large collections of documents, both the time and memory restrictions can be quite prohibitive. This justifies the application of features selection methods to reduce the dimensionality of the document-representation vector. Four feature selection methods are evaluated: Random Selection, Information Gain (IG), Support Vector Machine (called SVM_FS) and Genetic Algorithm with SVM (GA_FS). We showed that the best results were obtained with SVM_FS and GA_FS methods for a relatively small dimension of the features vector comparative with the IG method that involves longer vectors, for quite similar classification accuracies. Also we present a novel method to better correlate SVM kernel-s parameters (Polynomial or Gaussian kernel).

Zero Inflated Models for Overdispersed Count Data

The zero inflated models are usually used in modeling count data with excess zeros where the existence of the excess zeros could be structural zeros or zeros which occur by chance. These type of data are commonly found in various disciplines such as finance, insurance, biomedical, econometrical, ecology, and health sciences which involve sex and health dental epidemiology. The most popular zero inflated models used by many researchers are zero inflated Poisson and zero inflated negative binomial models. In addition, zero inflated generalized Poisson and zero inflated double Poisson models are also discussed and found in some literature. Recently zero inflated inverse trinomial model and zero inflated strict arcsine models are advocated and proven to serve as alternative models in modeling overdispersed count data caused by excessive zeros and unobserved heterogeneity. The purpose of this paper is to review some related literature and provide a variety of examples from different disciplines in the application of zero inflated models. Different model selection methods used in model comparison are discussed.

The Study on the Stationarity of Energy Consumption in US States: Considering Structural Breaks, Nonlinearity, and Cross- Sectional Dependency

This study applies the sequential panel selection method (SPSM) procedure proposed by Chortareas and Kapetanios (2009) to investigate the time-series properties of energy consumption in 50 US states from 1963 to 2009. SPSM involves the classification of the entire panel into a group of stationary series and a group of non-stationary series to identify how many and which series in the panel are stationary processes. Empirical results obtained through SPSM with the panel KSS unit root test developed by Ucar and Omay (2009) combined with a Fourier function indicate that energy consumption in all the 50 US states are stationary. The results of this study have important policy implications for the 50 US states.

Improving Worm Detection with Artificial Neural Networks through Feature Selection and Temporal Analysis Techniques

Computer worm detection is commonly performed by antivirus software tools that rely on prior explicit knowledge of the worm-s code (detection based on code signatures). We present an approach for detection of the presence of computer worms based on Artificial Neural Networks (ANN) using the computer's behavioral measures. Identification of significant features, which describe the activity of a worm within a host, is commonly acquired from security experts. We suggest acquiring these features by applying feature selection methods. We compare three different feature selection techniques for the dimensionality reduction and identification of the most prominent features to capture efficiently the computer behavior in the context of worm activity. Additionally, we explore three different temporal representation techniques for the most prominent features. In order to evaluate the different techniques, several computers were infected with five different worms and 323 different features of the infected computers were measured. We evaluated each technique by preprocessing the dataset according to each one and training the ANN model with the preprocessed data. We then evaluated the ability of the model to detect the presence of a new computer worm, in particular, during heavy user activity on the infected computers.

Selection Initial modes for Belief K-modes Method

The belief K-modes method (BKM) approach is a new clustering technique handling uncertainty in the attribute values of objects in both the cluster construction task and the classification one. Like the standard version of this method, the BKM results depend on the chosen initial modes. So, one selection method of initial modes is developed, in this paper, aiming at improving the performances of the BKM approach. Experiments with several sets of real data show that by considered the developed selection initial modes method, the clustering algorithm produces more accurate results.