Abstract: In this paper a combined feature selection method is
proposed which takes advantages of sample domain filtering,
resampling and feature subset evaluation methods to reduce
dimensions of huge datasets and select reliable features. This method
utilizes both feature space and sample domain to improve the process
of feature selection and uses a combination of Chi squared with
Consistency attribute evaluation methods to seek reliable features.
This method consists of two phases. The first phase filters and
resamples the sample domain and the second phase adopts a hybrid
procedure to find the optimal feature space by applying Chi squared,
Consistency subset evaluation methods and genetic search.
Experiments on various sized datasets from UCI Repository of
Machine Learning databases show that the performance of five
classifiers (Naïve Bayes, Logistic, Multilayer Perceptron, Best First
Decision Tree and JRIP) improves simultaneously and the
classification error for these classifiers decreases considerably. The
experiments also show that this method outperforms other feature
selection methods.
Abstract: Text categorization is the problem of classifying text documents into a set of predefined classes. After a preprocessing step the documents are typically represented as large sparse vectors. When training classifiers on large collections of documents, both the time and memory restrictions can be quite prohibitive. This justifies the application of features selection methods to reduce the dimensionality of the document-representation vector. Four feature selection methods are evaluated: Random Selection, Information Gain (IG), Support Vector Machine (called SVM_FS) and Genetic Algorithm with SVM (GA_FS). We showed that the best results were obtained with SVM_FS and GA_FS methods for a relatively small dimension of the features vector comparative with the IG method that involves longer vectors, for quite similar classification accuracies. Also we present a novel method to better correlate SVM kernel-s parameters (Polynomial or Gaussian kernel).
Abstract: The zero inflated models are usually used in modeling
count data with excess zeros where the existence of the excess zeros
could be structural zeros or zeros which occur by chance. These type
of data are commonly found in various disciplines such as finance,
insurance, biomedical, econometrical, ecology, and health sciences
which involve sex and health dental epidemiology. The most popular
zero inflated models used by many researchers are zero inflated
Poisson and zero inflated negative binomial models. In addition, zero
inflated generalized Poisson and zero inflated double Poisson models
are also discussed and found in some literature. Recently zero
inflated inverse trinomial model and zero inflated strict arcsine
models are advocated and proven to serve as alternative models in
modeling overdispersed count data caused by excessive zeros and
unobserved heterogeneity. The purpose of this paper is to review
some related literature and provide a variety of examples from
different disciplines in the application of zero inflated models.
Different model selection methods used in model comparison are
discussed.
Abstract: This study applies the sequential panel selection
method (SPSM) procedure proposed by Chortareas and Kapetanios
(2009) to investigate the time-series properties of energy
consumption in 50 US states from 1963 to 2009. SPSM involves the
classification of the entire panel into a group of stationary series and
a group of non-stationary series to identify how many and which
series in the panel are stationary processes. Empirical results obtained
through SPSM with the panel KSS unit root test developed by Ucar
and Omay (2009) combined with a Fourier function indicate that
energy consumption in all the 50 US states are stationary. The results
of this study have important policy implications for the 50 US states.
Abstract: Computer worm detection is commonly performed by
antivirus software tools that rely on prior explicit knowledge of the
worm-s code (detection based on code signatures). We present an
approach for detection of the presence of computer worms based on
Artificial Neural Networks (ANN) using the computer's behavioral
measures. Identification of significant features, which describe the
activity of a worm within a host, is commonly acquired from security
experts. We suggest acquiring these features by applying feature
selection methods. We compare three different feature selection
techniques for the dimensionality reduction and identification of the
most prominent features to capture efficiently the computer behavior
in the context of worm activity. Additionally, we explore three
different temporal representation techniques for the most prominent
features. In order to evaluate the different techniques, several
computers were infected with five different worms and 323 different
features of the infected computers were measured. We evaluated
each technique by preprocessing the dataset according to each one
and training the ANN model with the preprocessed data. We then
evaluated the ability of the model to detect the presence of a new
computer worm, in particular, during heavy user activity on the
infected computers.
Abstract: The belief K-modes method (BKM) approach is a new
clustering technique handling uncertainty in the attribute values of
objects in both the cluster construction task and the classification one.
Like the standard version of this method, the BKM results depend on
the chosen initial modes. So, one selection method of initial modes
is developed, in this paper, aiming at improving the performances of
the BKM approach. Experiments with several sets of real data show
that by considered the developed selection initial modes method, the
clustering algorithm produces more accurate results.