Abstract: K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with K-Modes and K-representative, which show that the new measure generates clusters with high purity.
Abstract: There are many situations where input feature vectors are incomplete and methods to tackle the problem have been studied for a long time. A commonly used procedure is to replace each missing value with an imputation. This paper presents a method to perform categorical missing data imputation from numerical and categorical variables. The imputations are based on Simpson-s fuzzy min-max neural networks where the input variables for learning and classification are just numerical. The proposed method extends the input to categorical variables by introducing new fuzzy sets, a new operation and a new architecture. The procedure is tested and compared with others using opinion poll data.
Abstract: Nevertheless the widespread application of finite
mixture models in segmentation, finite mixture model selection is
still an important issue. In fact, the selection of an adequate number
of segments is a key issue in deriving latent segments structures and
it is desirable that the selection criteria used for this end are effective.
In order to select among several information criteria, which may
support the selection of the correct number of segments we conduct a
simulation study. In particular, this study is intended to determine
which information criteria are more appropriate for mixture model
selection when considering data sets with only categorical
segmentation base variables. The generation of mixtures of
multinomial data supports the proposed analysis. As a result, we
establish a relationship between the level of measurement of
segmentation variables and some (eleven) information criteria-s
performance. The criterion AIC3 shows better performance (it
indicates the correct number of the simulated segments- structure
more often) when referring to mixtures of multinomial segmentation
base variables.
Abstract: Transport and land use are two systems that are
mutually influenced. Their interaction is a complex process
associated with continuous feedback. The paper examines the
existing land use around an under construction metro station of the
new metro network of Thessaloniki, Greece, through the use of field
investigations, around the station-s predefined location. Moreover,
except from the analytical land use recording, a sampling
questionnaire survey is addressed to several selected enterprises of
the study area. The survey aims to specify the characteristics of the
enterprises, the trip patterns of their employees and clients, as well as
the stated preferences towards the changes the new metro station is
considered to bring to the area. The interpretation of the interrelationships
among selected data from the questionnaire survey takes
place using the method of Principal Components Analysis for
Categorical Data. The followed methodology and the survey-s results
contribute to the enrichment of the relevant bibliography concerning
the way the creation of a new metro station can have an impact on the
land use pattern of an area, by examining the situation before the
operation of the station.
Abstract: Clustering is one of an interesting data mining topics
that can be applied in many fields. Recently, the problem of cluster
analysis is formulated as a problem of nonsmooth, nonconvex optimization,
and an algorithm for solving the cluster analysis problem
based on nonsmooth optimization techniques is developed. This
optimization problem has a number of characteristics that make it
challenging: it has many local minimum, the optimization variables
can be either continuous or categorical, and there are no exact
analytical derivatives. In this study we show how to apply a particular
class of optimization methods known as pattern search methods
to address these challenges. These methods do not explicitly use
derivatives, an important feature that has not been addressed in
previous studies. Results of numerical experiments are presented
which demonstrate the effectiveness of the proposed method.
Abstract: Categorical data based on description of the
agricultural landscape imposed some mathematical and analytical
limitations. This problem however can be overcome by data
transformation through coding scheme and the use of non-parametric
multivariate approach. The present study describes data
transformation from qualitative to numerical descriptors. In a
collection of 103 random soil samples over a 60 hectare field,
categorical data were obtained from the following variables: levels of
nitrogen, phosphorus, potassium, pH, hue, chroma, value and data on
topography, vegetation type, and the presence of rocks. Categorical
data were coded, and Spearman-s rho correlation was then calculated
using PAST software ver. 1.78 in which Principal Component
Analysis was based. Results revealed successful data transformation,
generating 1030 quantitative descriptors. Visualization based on the
new set of descriptors showed clear differences among sites, and
amount of variation was successfully measured. Possible applications
of data transformation are discussed.
Abstract: Clustering large populations is an important problem
when the data contain noise and different shapes. A good clustering
algorithm or approach should be efficient enough to detect clusters
sensitively. Besides space complexity, time complexity also gains
importance as the size grows. Using hierarchies we developed a new
algorithm to split attributes according to the values they have and
choosing the dimension for splitting so as to divide the database
roughly into equal parts as much as possible. At each node we
calculate some certain descriptive statistical features of the data
which reside and by pruning we generate the natural clusters with a
complexity of O(n).