Abstract: This study was designed enable application of multivariate technique in the interpretation of categorical data for measuring health care services satisfaction in Turkey. The data was collected from a total of 17726 respondents. The establishment of the sample group and collection of the data were carried out by a joint team from The Ministry of Health and Turkish Statistical Institute (Turk Stat) of Turkey. The multiple correspondence analysis (MCA) was used on the data of 2882 respondents who answered the questionnaire in full. The multiple correspondence analysis indicated that, in the evaluation of health services females, public employees, younger and more highly educated individuals were more concerned and complainant than males, private sector employees, older and less educated individuals. Overall 53 % of the respondents were pleased with the improvements in health care services in the past three years. This study demonstrates the public consciousness in health services and health care satisfaction in Turkey. It was found that most the respondents were pleased with the improvements in health care services over the past three years. Awareness of health service quality increases with education levels. Older individuals and males would appear to have lower expectancies in health services.
Abstract: Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.
Abstract: Over the past epoch a rampant amount of work has been done in the data clustering research under the unsupervised learning technique in Data mining. Furthermore several algorithms and methods have been proposed focusing on clustering different data types, representation of cluster models, and accuracy rates of the clusters. However no single clustering algorithm proves to be the most efficient in providing best results. Accordingly in order to find the solution to this issue a new technique, called Cluster ensemble method was bloomed. This cluster ensemble is a good alternative approach for facing the cluster analysis problem. The main hope of the cluster ensemble is to merge different clustering solutions in such a way to achieve accuracy and to improve the quality of individual data clustering. Due to the substantial and unremitting development of new methods in the sphere of data mining and also the incessant interest in inventing new algorithms, makes obligatory to scrutinize a critical analysis of the existing techniques and the future novelty. This paper exposes the comparative study of different cluster ensemble methods along with their features, systematic working process and the average accuracy and error rates of each ensemble methods. Consequently this speculative and comprehensive analysis will be very useful for the community of clustering practitioners and also helps in deciding the most suitable one to rectify the problem in hand.
Abstract: In this paper, we present a simplified higher-order Markov chain model for multiple categorical data sequences also called as simplified higher-order multivariate Markov chain model.
Abstract: We compare three categorical data clustering
algorithms with respect to the problem of classifying cultural data
related to the aesthetic judgment of comics artists. Such a
classification is very important in Comics Art theory since the
determination of any classes of similarities in such kind of data will
provide to art-historians very fruitful information of Comics Art-s
evolution. To establish this, we use a categorical data set and we
study it by employing three categorical data clustering algorithms.
The performances of these algorithms are compared each other,
while interpretations of the clustering results are also given.
Abstract: Clustering categorical data is more complicated than
the numerical clustering because of its special properties. Scalability
and memory constraint is the challenging problem in clustering large
data set. This paper presents an incremental algorithm to cluster the
categorical data. Frequencies of attribute values contribute much in
clustering similar categorical objects. In this paper we propose new
similarity measures based on the frequencies of attribute values and
its cardinalities. The proposed measures and the algorithm are
experimented with the data sets from UCI data repository. Results
prove that the proposed method generates better clusters than the
existing one.
Abstract: We develop a three-step fuzzy logic-based algorithm for clustering categorical attributes, and we apply it to analyze cultural data. In the first step the algorithm employs an entropy-based clustering scheme, which initializes the cluster centers. In the second step we apply the fuzzy c-modes algorithm to obtain a fuzzy partition of the data set, and the third step introduces a novel cluster validity index, which decides the final number of clusters.
Abstract: A Matlab based software for logistic regression is developed to enhance the process of teaching quantitative topics and assist researchers with analyzing wide area of applications where categorical data is involved. The software offers an option of performing stepwise logistic regression to select the most significant predictors. The software includes a feature to detect influential observations in data, and investigates the effect of dropping or misclassifying an observation on a predictor variable. The input data may consist either as a set of individual responses (yes/no) with the predictor variables or as grouped records summarizing various categories for each unique set of predictor variables' values. Graphical displays are used to output various statistical results and to assess the goodness of fit of the logistic regression model. The software recognizes possible convergence constraints when present in data, and the user is notified accordingly.
Abstract: Lacking an inherent “natural" dissimilarity measure
between objects in categorical dataset presents special difficulties in
clustering analysis. However, each categorical attributes from a given
dataset provides natural probability and information in the sense of
Shannon. In this paper, we proposed a novel method which
heuristically converts categorical attributes to numerical values by
exploiting such associated information. We conduct an experimental
study with real-life categorical dataset. The experiment demonstrates
the effectiveness of our approach.
Abstract: K-Modes is an extension of K-Means clustering algorithm, developed to cluster the categorical data, where the mean is replaced by the mode. The similarity measure proposed by Huang is the simple matching or mismatching measure. Weight of attribute values contribute much in clustering; thus in this paper we propose a new weighted dissimilarity measure for K-Modes, based on the ratio of frequency of attribute values in the cluster and in the data set. The new weighted measure is experimented with the data sets obtained from the UCI data repository. The results are compared with K-Modes and K-representative, which show that the new measure generates clusters with high purity.
Abstract: Transport and land use are two systems that are
mutually influenced. Their interaction is a complex process
associated with continuous feedback. The paper examines the
existing land use around an under construction metro station of the
new metro network of Thessaloniki, Greece, through the use of field
investigations, around the station-s predefined location. Moreover,
except from the analytical land use recording, a sampling
questionnaire survey is addressed to several selected enterprises of
the study area. The survey aims to specify the characteristics of the
enterprises, the trip patterns of their employees and clients, as well as
the stated preferences towards the changes the new metro station is
considered to bring to the area. The interpretation of the interrelationships
among selected data from the questionnaire survey takes
place using the method of Principal Components Analysis for
Categorical Data. The followed methodology and the survey-s results
contribute to the enrichment of the relevant bibliography concerning
the way the creation of a new metro station can have an impact on the
land use pattern of an area, by examining the situation before the
operation of the station.
Abstract: Categorical data based on description of the
agricultural landscape imposed some mathematical and analytical
limitations. This problem however can be overcome by data
transformation through coding scheme and the use of non-parametric
multivariate approach. The present study describes data
transformation from qualitative to numerical descriptors. In a
collection of 103 random soil samples over a 60 hectare field,
categorical data were obtained from the following variables: levels of
nitrogen, phosphorus, potassium, pH, hue, chroma, value and data on
topography, vegetation type, and the presence of rocks. Categorical
data were coded, and Spearman-s rho correlation was then calculated
using PAST software ver. 1.78 in which Principal Component
Analysis was based. Results revealed successful data transformation,
generating 1030 quantitative descriptors. Visualization based on the
new set of descriptors showed clear differences among sites, and
amount of variation was successfully measured. Possible applications
of data transformation are discussed.
Abstract: Clustering large populations is an important problem
when the data contain noise and different shapes. A good clustering
algorithm or approach should be efficient enough to detect clusters
sensitively. Besides space complexity, time complexity also gains
importance as the size grows. Using hierarchies we developed a new
algorithm to split attributes according to the values they have and
choosing the dimension for splitting so as to divide the database
roughly into equal parts as much as possible. At each node we
calculate some certain descriptive statistical features of the data
which reside and by pruning we generate the natural clusters with a
complexity of O(n).