Abstract: In Data mining, Fuzzy clustering algorithms have
demonstrated advantage over crisp clustering algorithms in dealing
with the challenges posed by large collections of vague and uncertain
natural data. This paper reviews concept of fuzzy logic and fuzzy
clustering. The classical fuzzy c-means algorithm is presented and its
limitations are highlighted. Based on the study of the fuzzy c-means
algorithm and its extensions, we propose a modification to the cmeans
algorithm to overcome the limitations of it in calculating the
new cluster centers and in finding the membership values with
natural data. The efficiency of the new modified method is
demonstrated on real data collected for Bhutan-s Gross National
Happiness (GNH) program.
Abstract: This paper presents a software quality support tool, a
Java source code evaluator and a code profiler based on
computational intelligence techniques. It is Java prototype software
developed by AI Group [1] from the Research Laboratories at
Universidad de Palermo: an Intelligent Java Analyzer (in Spanish:
Analizador Java Inteligente, AJI). It represents a new approach to
evaluate and identify inaccurate source code usage and transitively,
the software product itself.
The aim of this project is to provide the software development
industry with a new tool to increase software quality by extending
the value of source code metrics through computational intelligence.
Abstract: The goal of this paper is to segment the countries
based on the value of export from Iran during 14 years ending at 2005. To measure the dissimilarity among export baskets of different countries, we define Dissimilarity Export Basket (DEB) function and
use this distance function in K-means algorithm. The DEB function
is defined based on the concepts of the association rules and the
value of export group-commodities. In this paper, clustering quality
function and clusters intraclass inertia are defined to, respectively,
calculate the optimum number of clusters and to compare the
functionality of DEB versus Euclidean distance. We have also study
the effects of importance weight in DEB function to improve
clustering quality. Lastly when segmentation is completed, a
designated RFM model is used to analyze the relative profitability of
each cluster.
Abstract: In blended learning environments, the Internet can be combined with other technologies. The aim of this research was to design, introduce and validate a model to support synchronous and asynchronous activities by managing content domains in an Adaptive Hypermedia System (AHS). The application is based on information recovery techniques, clustering algorithms and adaptation rules to adjust the user's model to contents and objects of study. This system was applied to blended learning in higher education. The research strategy used was the case study method. Empirical studies were carried out on courses at two universities to validate the model. The results of this research show that the model had a positive effect on the learning process. The students indicated that the synchronous and asynchronous scenario is a good option, as it involves a combination of work with the lecturer and the AHS. In addition, they gave positive ratings to the system and stated that the contents were adapted to each user profile.
Abstract: Data clustering is an important data exploration technique
with many applications in data mining. We present an enhanced
version of the well known single link clustering algorithm. We will
refer to this algorithm as DCBOR. The proposed algorithm alleviates
the chain effect by removing the outliers from the given dataset.
So this algorithm provides outlier detection and data clustering
simultaneously. This algorithm does not need to update the distance
matrix, since the algorithm depends on merging the most k-nearest
objects in one step and the cluster continues grow as long as possible
under specified condition. So the algorithm consists of two phases;
at the first phase, it removes the outliers from the input dataset. At
the second phase, it performs the clustering process. This algorithm
discovers clusters of different shapes, sizes, densities and requires
only one input parameter; this parameter represents a threshold for
outlier points. The value of the input parameter is ranging from 0 to
1. The algorithm supports the user in determining an appropriate
value for it. We have tested this algorithm on different datasets
contain outlier and connecting clusters by chain of density points,
and the algorithm discovers the correct clusters. The results of
our experiments demonstrate the effectiveness and the efficiency of
DCBOR.
Abstract: This paper introduces a measure of similarity between
two clusterings of the same dataset produced by two different
algorithms, or even the same algorithm (K-means, for instance, with
different initializations usually produce different results in clustering
the same dataset). We then apply the measure to calculate the
similarity between pairs of clusterings, with special interest directed
at comparing the similarity between various machine clusterings and
human clustering of datasets. The similarity measure thus can be used
to identify the best (in terms of most similar to human) clustering
algorithm for a specific problem at hand. Experimental results
pertaining to the text categorization problem of a Portuguese corpus
(wherein a translation-into-English approach is used) are presented, as well as results on the well-known benchmark IRIS dataset. The
significance and other potential applications of the proposed measure
are discussed.
Abstract: Wavelet neural networks (WNNs) have emerged as a vital alternative to the vastly studied multilayer perceptrons (MLPs) since its first implementation. In this paper, we applied various clustering algorithms, namely, K-means (KM), Fuzzy C-means (FCM), symmetry-based K-means (SBKM), symmetry-based Fuzzy C-means (SBFCM) and modified point symmetry-based K-means (MPKM) clustering algorithms in choosing the translation parameter of a WNN. These modified WNNs are further applied to the heterogeneous cancer classification using benchmark microarray data and were compared against the conventional WNN with random initialization method. Experimental results showed that a WNN classifier with the MPKM algorithm is more precise than the conventional WNN as well as the WNNs with other clustering algorithms.
Abstract: Our objective in this paper is to propose an approach
capable of clustering web messages. The clustering is carried out by
assigning, with a certain probability, texts written by the same web
user to the same cluster based on Stylometric features and using
fuzzy clustering algorithms. Focus in the present work is on
comparing the most popular algorithms in fuzzy clustering theory
namely, Fuzzy C-means, Possibilistic C-means and Fuzzy
Possibilistic C-Means.
Abstract: Biological data has several characteristics that strongly differentiate it from typical business data. It is much more complex, usually large in size, and continuously changes. Until recently business data has been the main target for discovering trends, patterns or future expectations. However, with the recent rise in biotechnology, the powerful technology that was used for analyzing business data is now being applied to biological data. With the advanced technology at hand, the main trend in biological research is rapidly changing from structural DNA analysis to understanding cellular functions of the DNA sequences. DNA chips are now being used to perform experiments and DNA analysis processes are being used by researchers. Clustering is one of the important processes used for grouping together similar entities. There are many clustering algorithms such as hierarchical clustering, self-organizing maps, K-means clustering and so on. In this paper, we propose a clustering algorithm that imitates the ecosystem taking into account the features of biological data. We implemented the system using an Ant-Colony clustering algorithm. The system decides the number of clusters automatically. The system processes the input biological data, runs the Ant-Colony algorithm, draws the Topic Map, assigns clusters to the genes and displays the output. We tested the algorithm with a test data of 100 to1000 genes and 24 samples and show promising results for applying this algorithm to clustering DNA chip data.
Abstract: Microarray experiments are information rich; however, extensive data mining is required to identify the patterns that characterize the underlying mechanisms of action. For biologists, a key aim when analyzing microarray data is to group genes based on the temporal patterns of their expression levels. In this paper, we used an iterative clustering method to find temporal patterns of gene expression. We evaluated the performance of this method by applying it to real sporulation data and simulated data. The patterns obtained using the iterative clustering were found to be superior to those obtained using existing clustering algorithms.
Abstract: Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, a density based clustering algorithm (DCBRD) is presented, relying on a knowledge acquired from the data by dividing the data space into overlapped regions. The proposed algorithm discovers arbitrary shaped clusters, requires no input parameters and uses the same definitions of DBSCAN algorithm. We performed an experimental evaluation of the effectiveness and efficiency of it, and compared this results with that of DBSCAN. The results of our experiments demonstrate that the proposed algorithm is significantly efficient in discovering clusters of arbitrary shape and size.
Abstract: Although there have been many researches in cluster
analysis to consider on feature weights, little effort is made on sample
weights. Recently, Yu et al. (2011) considered a probability
distribution over a data set to represent its sample weights and then
proposed sample-weighted clustering algorithms. In this paper, we
give a sample-weighted version of generalized fuzzy clustering
regularization (GFCR), called the sample-weighted GFCR
(SW-GFCR). Some experiments are considered. These experimental
results and comparisons demonstrate that the proposed SW-GFCR is
more effective than the most clustering algorithms.
Abstract: Clustering algorithms help to understand the hidden
information present in datasets. A dataset may contain intrinsic and
nested clusters, the detection of which is of utmost importance. This
paper presents a Distributed Grid-based Density Clustering algorithm
capable of identifying arbitrary shaped embedded clusters as well as
multi-density clusters over large spatial datasets. For handling
massive datasets, we implemented our method using a 'sharednothing'
architecture where multiple computers are interconnected
over a network. Experimental results are reported to establish the
superiority of the technique in terms of scale-up, speedup as well as
cluster quality.
Abstract: Self-organizing map (SOM) is a well known data
reduction technique used in data mining. It can reveal structure in
data sets through data visualization that is otherwise hard to detect
from raw data alone. However, interpretation through visual
inspection is prone to errors and can be very tedious. There are
several techniques for the automatic detection of clusters of code
vectors found by SOM, but they generally do not take into account
the distribution of code vectors; this may lead to unsatisfactory
clustering and poor definition of cluster boundaries, particularly
where the density of data points is low. In this paper, we propose the
use of an adaptive heuristic particle swarm optimization (PSO)
algorithm for finding cluster boundaries directly from the code
vectors obtained from SOM. The application of our method to
several standard data sets demonstrates its feasibility. PSO algorithm
utilizes a so-called U-matrix of SOM to determine cluster boundaries;
the results of this novel automatic method compare very favorably to
boundary detection through traditional algorithms namely k-means
and hierarchical based approach which are normally used to interpret
the output of SOM.
Abstract: In the last decades, a number of robust fuzzy clustering algorithms have been proposed to partition data sets affected by noise and outliers. Robust fuzzy C-means (robust-FCM) is certainly one of the most known among these algorithms. In robust-FCM, noise is modeled as a separate cluster and is characterized by a prototype that has a constant distance δ from all data points. Distance δ determines the boundary of the noise cluster and therefore is a critical parameter of the algorithm. Though some approaches have been proposed to automatically determine the most suitable δ for the specific application, up to today an efficient and fully satisfactory solution does not exist. The aim of this paper is to propose a novel method to compute the optimal δ based on the analysis of the distribution of the percentage of objects assigned to the noise cluster in repeated executions of the robust-FCM with decreasing values of δ . The extremely encouraging results obtained on some data sets found in the literature are shown and discussed.
Abstract: In this paper we study the fuzzy c-mean clustering algorithm
combined with principal components method. Demonstratively
analysis indicate that the new clustering method is well rather than
some clustering algorithms. We also consider the validity of clustering
method.
Abstract: As wireless sensor networks are energy constraint networks
so energy efficiency of sensor nodes is the main design issue.
Clustering of nodes is an energy efficient approach. It prolongs the
lifetime of wireless sensor networks by avoiding long distance communication.
Clustering algorithms operate in rounds. Performance of
clustering algorithm depends upon the round time. A large round
time consumes more energy of cluster heads while a small round
time causes frequent re-clustering. So existing clustering algorithms
apply a trade off to round time and calculate it from the initial
parameters of networks. But it is not appropriate to use initial
parameters based round time value throughout the network lifetime
because wireless sensor networks are dynamic in nature (nodes can be
added to the network or some nodes go out of energy). In this paper
a variable round time approach is proposed that calculates round
time depending upon the number of active nodes remaining in the
field. The proposed approach makes the clustering algorithm adaptive
to network dynamics. For simulation the approach is implemented
with LEACH in NS-2 and the results show that there is 6% increase
in network lifetime, 7% increase in 50% node death time and 5%
improvement over the data units gathered at the base station.
Abstract: In this paper, a clustering algorithm named KHarmonic
means (KHM) was employed in the training of Radial
Basis Function Networks (RBFNs). KHM organized the data in
clusters and determined the centres of the basis function. The popular
clustering algorithms, namely K-means (KM) and Fuzzy c-means
(FCM), are highly dependent on the initial identification of elements
that represent the cluster well. In KHM, the problem can be avoided.
This leads to improvement in the classification performance when
compared to other clustering algorithms. A comparison of the
classification accuracy was performed between KM, FCM and KHM.
The classification performance is based on the benchmark data sets:
Iris Plant, Diabetes and Breast Cancer. RBFN training with the KHM
algorithm shows better accuracy in classification problem.
Abstract: This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. For understanding and interpretation of clustering results, a few keywords that can best represent the semantic topic are extracted from each cluster. Two methods are used to extract the representative words. The candidate words are first selected according to their weights calculated by our new algorithm. Then, the candidates are fed to the WordNet to identify the set of noun words and consolidate the synonymy and hyponymy words. Experimental results have shown that the clustering algorithm is superior to the other subspace clustering algorithms, such as PROCLUS and HARP and kmeans type algorithm, e.g., Bisecting-KMeans. Furthermore, the word extraction method is effective in selection of the words to represent the topics of the clusters.
Abstract: Most of fuzzy clustering algorithms have some
discrepancies, e.g. they are not able to detect clusters with convex
shapes, the number of the clusters should be a priori known, they
suffer from numerical problems, like sensitiveness to the
initialization, etc. This paper studies the synergistic combination of
the hierarchical and graph theoretic minimal spanning tree based
clustering algorithm with the partitional Gath-Geva fuzzy clustering
algorithm. The aim of this hybridization is to increase the robustness
and consistency of the clustering results and to decrease the number
of the heuristically defined parameters of these algorithms to
decrease the influence of the user on the clustering results. For the
analysis of the resulted fuzzy clusters a new fuzzy similarity measure
based tool has been presented. The calculated similarities of the
clusters can be used for the hierarchical clustering of the resulted
fuzzy clusters, which information is useful for cluster merging and
for the visualization of the clustering results. As the examples used
for the illustration of the operation of the new algorithm will show,
the proposed algorithm can detect clusters from data with arbitrary
shape and does not suffer from the numerical problems of the
classical Gath-Geva fuzzy clustering algorithm.