Abstract: In this paper, we propose an algorithm to compute
initial cluster centers for K-means clustering. Data in a cell is
partitioned using a cutting plane that divides cell in two smaller cells.
The plane is perpendicular to the data axis with the highest variance
and is designed to reduce the sum squared errors of the two cells as
much as possible, while at the same time keep the two cells far apart
as possible. Cells are partitioned one at a time until the number of
cells equals to the predefined number of clusters, K. The centers of
the K cells become the initial cluster centers for K-means. The
experimental results suggest that the proposed algorithm is effective,
converge to better clustering results than those of the random
initialization method. The research also indicated the proposed
algorithm would greatly improve the likelihood of every cluster
containing some data in it.
Abstract: In this paper, we present a new and effective image indexing technique that extracts features directly from DCT domain. Our proposed approach is an object-based image indexing. For each block of size 8*8 in DCT domain a feature vector is extracted. Then, feature vectors of all blocks of image using a k-means algorithm is clustered into groups. Each cluster represents a special object of the image. Then we select some clusters that have largest members after clustering. The centroids of the selected clusters are taken as image feature vectors and indexed into the database. Also, we propose an approach for using of proposed image indexing method in automatic image classification. Experimental results on a database of 800 images from 8 semantic groups in automatic image classification are reported.
Abstract: In the present work, we propose a new technique to
enhance the learning capabilities and reduce the computation
intensity of a competitive learning multi-layered neural network
using the K-means clustering algorithm. The proposed model use
multi-layered network architecture with a back propagation learning
mechanism. The K-means algorithm is first applied to the training
dataset to reduce the amount of samples to be presented to the neural
network, by automatically selecting an optimal set of samples. The
obtained results demonstrate that the proposed technique performs
exceptionally in terms of both accuracy and computation time when
applied to the KDD99 dataset compared to a standard learning
schema that use the full dataset.
Abstract: Geographic Profiling has successfully assisted investigations for serial crimes. Considering the multi-cluster feature of serial criminal spots, we propose a Multi-point Centrography model as a natural extension of Single-point Centrography for geographic profiling. K-means clustering is first performed on the data samples and then Single-point Centrography is adopted to derive a probability distribution on each cluster. Finally, a weighted combinations of each distribution is formed to make next-crime spot prediction. Experimental study on real cases demonstrates the effectiveness of our proposed model.
Abstract: Intelligent systems based on machine learning
techniques, such as classification, clustering, are gaining wide spread
popularity in real world applications. This paper presents work on
developing a software system for predicting crop yield, for example
oil-palm yield, from climate and plantation data. At the core of our
system is a method for unsupervised partitioning of data for finding
spatio-temporal patterns in climate data using kernel methods which
offer strength to deal with complex data. This work gets inspiration
from the notion that a non-linear data transformation into some high
dimensional feature space increases the possibility of linear
separability of the patterns in the transformed space. Therefore, it
simplifies exploration of the associated structure in the data. Kernel
methods implicitly perform a non-linear mapping of the input data
into a high dimensional feature space by replacing the inner products
with an appropriate positive definite function. In this paper we
present a robust weighted kernel k-means algorithm incorporating
spatial constraints for clustering the data. The proposed algorithm
can effectively handle noise, outliers and auto-correlation in the
spatial data, for effective and efficient data analysis by exploring
patterns and structures in the data, and thus can be used for
predicting oil-palm yield by analyzing various factors affecting the
yield.
Abstract: The paper presents a complete discrete statistical framework, based on a novel vector quantization (VQ) front-end process. This new VQ approach performs an optimal distribution of VQ codebook components on HMM states. This technique that we named the distributed vector quantization (DVQ) of hidden Markov models, succeeds in unifying acoustic micro-structure and phonetic macro-structure, when the estimation of HMM parameters is performed. The DVQ technique is implemented through two variants. The first variant uses the K-means algorithm (K-means- DVQ) to optimize the VQ, while the second variant exploits the benefits of the classification behavior of neural networks (NN-DVQ) for the same purpose. The proposed variants are compared with the HMM-based baseline system by experiments of specific Arabic consonants recognition. The results show that the distributed vector quantization technique increase the performance of the discrete HMM system.
Abstract: Clustering is a very well known technique in data mining. One of the most widely used clustering techniques is the kmeans algorithm. Solutions obtained from this technique depend on the initialization of cluster centers and the final solution converges to local minima. In order to overcome K-means algorithm shortcomings, this paper proposes a hybrid evolutionary algorithm based on the combination of PSO, SA and K-means algorithms, called PSO-SA-K, which can find better cluster partition. The performance is evaluated through several benchmark data sets. The simulation results show that the proposed algorithm outperforms previous approaches, such as PSO, SA and K-means for partitional clustering problem.
Abstract: Documents clustering become an essential technology
with the popularity of the Internet. That also means that fast and
high-quality document clustering technique play core topics. Text
clustering or shortly clustering is about discovering semantically
related groups in an unstructured collection of documents. Clustering
has been very popular for a long time because it provides unique
ways of digesting and generalizing large amounts of information.
One of the issues of clustering is to extract proper feature (concept)
of a problem domain. The existing clustering technology mainly
focuses on term weight calculation. To achieve more accurate
document clustering, more informative features including concept
weight are important. Feature Selection is important for clustering
process because some of the irrelevant or redundant feature may
misguide the clustering results. To counteract this issue, the proposed
system presents the concept weight for text clustering system
developed based on a k-means algorithm in accordance with the
principles of ontology so that the important of words of a cluster can
be identified by the weight values. To a certain extent, it has resolved
the semantic problem in specific areas.
Abstract: The prediction of Software quality during development life cycle of software project helps the development organization to make efficient use of available resource to produce the product of highest quality. “Whether a module is faulty or not" approach can be used to predict quality of a software module. There are numbers of software quality prediction models described in the literature based upon genetic algorithms, artificial neural network and other data mining algorithms. One of the promising aspects for quality prediction is based on clustering techniques. Most quality prediction models that are based on clustering techniques make use of K-means, Mixture-of-Guassians, Self-Organizing Map, Neural Gas and fuzzy K-means algorithm for prediction. In all these techniques a predefined structure is required that is number of neurons or clusters should be known before we start clustering process. But in case of Growing Neural Gas there is no need of predetermining the quantity of neurons and the topology of the structure to be used and it starts with a minimal neurons structure that is incremented during training until it reaches a maximum number user defined limits for clusters. Hence, in this work we have used Growing Neural Gas as underlying cluster algorithm that produces the initial set of labeled cluster from training data set and thereafter this set of clusters is used to predict the quality of test data set of software modules. The best testing results shows 80% accuracy in evaluating the quality of software modules. Hence, the proposed technique can be used by programmers in evaluating the quality of modules during software development.