Abstract: With the continuous increment of smart meter installations across the globe, the need for processing of the load data is evident. Clustering-based load profiling is built upon the utilization of unsupervised machine learning tools for the purpose of formulating the typical load curves or load profiles. The most commonly used algorithm in the load profiling literature is the K-means. While the algorithm has been successfully tested in a variety of applications, its drawback is the strong dependence in the initialization phase. This paper proposes a novel modified form of the K-means that addresses the aforementioned problem. Simulation results indicate the superiority of the proposed algorithm compared to the K-means.
Abstract: Data mining technique used in the field of clustering is a subject of active research and assists in biological pattern recognition and extraction of new knowledge from raw data. Clustering means the act of partitioning an unlabeled dataset into groups of similar objects. Each group, called a cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Several clustering methods are based on partitional clustering. This category attempts to directly decompose the dataset into a set of disjoint clusters leading to an integer number of clusters that optimizes a given criterion function. The criterion function may emphasize a local or a global structure of the data, and its optimization is an iterative relocation procedure. The K-Means algorithm is one of the most widely used partitional clustering techniques. Since K-Means is extremely sensitive to the initial choice of centers and a poor choice of centers may lead to a local optimum that is quite inferior to the global optimum, we propose a strategy to initiate K-Means centers. The improved K-Means algorithm is compared with the original K-Means, and the results prove how the efficiency has been significantly improved.
Abstract: In this paper, we propose labeling based RANSAC algorithm for lane detection. Advanced driver assistance systems (ADAS) have been widely researched to avoid unexpected accidents. Lane detection is a necessary system to assist keeping lane and lane departure prevention. The proposed vision based lane detection method applies Canny edge detection, inverse perspective mapping (IPM), K-means algorithm, mathematical morphology operations and 8 connected-component labeling. Next, random samples are selected from each labeling region for RANSAC. The sampling method selects the points of lane with a high probability. Finally, lane parameters of straight line or curve equations are estimated. Through the simulations tested on video recorded at daytime and nighttime, we show that the proposed method has better performance than the existing RANSAC algorithm in various environments.
Abstract: The aim of this work is the parallel implementation
of k-means in MATLAB, in order to reduce the execution time.
Specifically, a new function in MATLAB for serial k-means algorithm
is developed, which meets all the requirements for the conversion to a
function in MATLAB with parallel computations. Additionally, two
different variants for the definition of initial values are presented.
In the sequel, the parallel approach is presented. Finally, the
performance tests for the computation times respect to the numbers
of features and classes are illustrated.
Abstract: Matching high dimensional features between images is computationally expensive for exhaustive search approaches in computer vision. Although the dimension of the feature can be degraded by simplifying the prior knowledge of homography, matching accuracy may degrade as a tradeoff. In this paper, we present a feature matching method based on k-means algorithm that reduces the matching cost and matches the features between images instead of using a simplified geometric assumption. Experimental results show that the proposed method outperforms the previous linear exhaustive search approaches in terms of the inlier ratio of matched pairs.
Abstract: Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.
Abstract: The traditional k-means algorithm has been widely used as a simple and efficient clustering method. However, the algorithm often converges to local minima for the reason that it is sensitive to the initial cluster centers. In this paper, an algorithm for selecting initial cluster centers on the basis of minimum spanning tree (MST) is presented. The set of vertices in MST with same degree are regarded as a whole which is used to find the skeleton data points. Furthermore, a distance measure between the skeleton data points with consideration of degree and Euclidean distance is presented. Finally, MST-based initialization method for the k-means algorithm is presented, and the corresponding time complexity is analyzed as well. The presented algorithm is tested on five data sets from the UCI Machine Learning Repository. The experimental results illustrate the effectiveness of the presented algorithm compared to three existing initialization methods.
Abstract: Given the increase in the number of e-commerce sites,
the number of competitors has become very important. This means
that companies have to take appropriate decisions in order to meet the
expectations of their customers and satisfy their needs. In this paper,
we present a case study of applying LRFM (length, recency,
frequency and monetary) model and clustering techniques in the
sector of electronic commerce with a view to evaluating customers’
values of the Moroccan e-commerce websites and then developing
effective marketing strategies. To achieve these objectives, we adopt
LRFM model by applying a two-stage clustering method. In the first
stage, the self-organizing maps method is used to determine the best
number of clusters and the initial centroid. In the second stage, kmeans
method is applied to segment 730 customers into nine clusters
according to their L, R, F and M values. The results show that the
cluster 6 is the most important cluster because the average values of
L, R, F and M are higher than the overall average value. In addition,
this study has considered another variable that describes the mode of
payment used by customers to improve and strengthen clusters’
analysis. The clusters’ analysis demonstrates that the payment method is
one of the key indicators of a new index which allows to assess the
level of customers’ confidence in the company's Website.
Abstract: Microarrays are made it possible to simultaneously monitor the expression profiles of thousands of genes under various experimental conditions. It is used to identify the co-expressed genes in specific cells or tissues that are actively used to make proteins. This method is used to analysis the gene expression, an important task in bioinformatics research. Cluster analysis of gene expression data has proved to be a useful tool for identifying co-expressed genes, biologically relevant groupings of genes and samples. In this work K-Means algorithms has been applied for clustering of Gene Expression Data. Further, rough set based Quick reduct algorithm has been applied for each cluster in order to select the most similar genes having high correlation. Then the ACV measure is used to evaluate the refined clusters and classification is used to evaluate the proposed method. They could identify compact clusters with feature selection method used to genes are selected.
Abstract: Feature selection is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that all genes are not important in gene expression data. Some of the genes may be redundant, and others may be irrelevant and noisy. Here a novel approach is proposed Hybrid K-Mean-Quick Reduct (KMQR) algorithm for gene selection from gene expression data. In this study, the entire dataset is divided into clusters by applying K-Means algorithm. Each cluster contains similar genes. The high class discriminated genes has been selected based on their degree of dependence by applying Quick Reduct algorithm to all the clusters. Average Correlation Value (ACV) is calculated for the high class discriminated genes. The clusters which have the ACV value as 1 is determined as significant clusters, whose classification accuracy will be equal or high when comparing to the accuracy of the entire dataset. The proposed algorithm is evaluated using WEKA classifiers and compared. The proposed work shows that the high classification accuracy.
Abstract: Clustering is the process of subdividing an input data set into a desired number of subgroups so that members of the same subgroup are similar and members of different subgroups have diverse properties. Many heuristic algorithms have been applied to the clustering problem, which is known to be NP Hard. Genetic algorithms have been used in a wide variety of fields to perform clustering, however, the technique normally has a long running time in terms of input set size. This paper proposes an efficient genetic algorithm for clustering on very large data sets, especially on image data sets. The genetic algorithm uses the most time efficient techniques along with preprocessing of the input data set. We test our algorithm on both artificial and real image data sets, both of which are of large size. The experimental results show that our algorithm outperforms the k-means algorithm in terms of running time as well as the quality of the clustering.
Abstract: Color image segmentation can be considered as a
cluster procedure in feature space. k-means and its adaptive
version, i.e. competitive learning approach are powerful tools
for data clustering. But k-means and competitive learning suffer
from several drawbacks such as dead-unit problem and need to
pre-specify number of cluster. In this paper, we will explore to
use competitive and cooperative learning approach to perform
color image segmentation. In competitive and cooperative
learning approach, seed points not only compete each other, but
also the winner will dynamically select several nearest
competitors to form a cooperative team to adapt to the input
together, finally it can automatically select the correct number
of cluster and avoid the dead-units problem. Experimental
results show that CCL can obtain better segmentation result.
Abstract: Self-organizing map (SOM) is a well known data reduction technique used in data mining. Data visualization can reveal structure in data sets that is otherwise hard to detect from raw data alone. However, interpretation through visual inspection is prone to errors and can be very tedious. There are several techniques for the automatic detection of clusters of code vectors found by SOMs, but they generally do not take into account the distribution of code vectors; this may lead to unsatisfactory clustering and poor definition of cluster boundaries, particularly where the density of data points is low. In this paper, we propose the use of a generic particle swarm optimization (PSO) algorithm for finding cluster boundaries directly from the code vectors obtained from SOMs. The application of our method to unlabeled call data for a mobile phone operator demonstrates its feasibility. PSO algorithm utilizes U-matrix of SOMs to determine cluster boundaries; the results of this novel automatic method correspond well to boundary detection through visual inspection of code vectors and k-means algorithm.
Abstract: Clustering techniques have received attention in many areas including engineering, medicine, biology and data mining. The purpose of clustering is to group together data points, which are close to one another. The K-means algorithm is one of the most widely used techniques for clustering. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. This paper is presented an efficient hybrid evolutionary optimization algorithm based on combining Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO), called PSO-ACO, for optimally clustering N object into K clusters. The new PSO-ACO algorithm is tested on several data sets, and its performance is compared with those of ACO, PSO and K-means clustering. The simulation results show that the proposed evolutionary optimization algorithm is robust and suitable for handing data clustering.
Abstract: The goal of this paper is to segment the countries
based on the value of export from Iran during 14 years ending at 2005. To measure the dissimilarity among export baskets of different countries, we define Dissimilarity Export Basket (DEB) function and
use this distance function in K-means algorithm. The DEB function
is defined based on the concepts of the association rules and the
value of export group-commodities. In this paper, clustering quality
function and clusters intraclass inertia are defined to, respectively,
calculate the optimum number of clusters and to compare the
functionality of DEB versus Euclidean distance. We have also study
the effects of importance weight in DEB function to improve
clustering quality. Lastly when segmentation is completed, a
designated RFM model is used to analyze the relative profitability of
each cluster.
Abstract: Artificial Bee Colony (ABC) algorithm is a relatively new swarm intelligence technique for clustering. It produces higher
quality clusters compared to other population-based algorithms but with poor energy efficiency, cluster quality consistency and typically slower in convergence speed. Inspired by energy saving foraging behavior of natural honey bees this paper presents a Quality and Quantity Aware Artificial Bee Colony (Q2ABC) algorithm to improve quality of cluster identification, energy efficiency and convergence speed of the original ABC. To evaluate the performance of Q2ABC algorithm, experiments were conducted on a suite of ten benchmark UCI datasets. The results demonstrate Q2ABC outperformed ABC and K-means algorithm in the quality of clusters delivered.
Abstract: This study aims to segment objects using the K-means
algorithm for texture features. Firstly, the algorithm transforms color
images into gray images. This paper describes a novel technique for
the extraction of texture features in an image. Then, in a group of
similar features, objects and backgrounds are differentiated by using
the K-means algorithm. Finally, this paper proposes a new object
segmentation algorithm using the morphological technique. The
experiments described include the segmentation of single and multiple
objects featured in this paper. The region of an object can be
accurately segmented out. The results can help to perform image
retrieval and analyze features of an object, as are shown in this paper.
Abstract: In this paper we present a new approach to deal with
image segmentation. The fact that a single segmentation result do not
generally allow a higher level process to take into account all the
elements included in the image has motivated the consideration of
image segmentation as a multiobjective optimization problem. The
proposed algorithm adopts a split/merge strategy that uses the result
of the k-means algorithm as input for a quantum evolutionary
algorithm to establish a set of non-dominated solutions. The
evaluation is made simultaneously according to two distinct features:
intra-region homogeneity and inter-region heterogeneity. The
experimentation of the new approach on natural images has proved
its efficiency and usefulness.
Abstract: Clustering is a very well known technique in data mining. One of the most widely used clustering techniques is the k-means algorithm. Solutions obtained from this technique are dependent on the initialization of cluster centers. In this article we propose a new algorithm to initialize the clusters. The proposed algorithm is based on finding a set of medians extracted from a dimension with maximum variance. The algorithm has been applied to different data sets and good results are obtained.
Abstract: Color Image quantization (CQ) is an important
problem in computer graphics, image and processing. The aim of
quantization is to reduce colors in an image with minimum distortion.
Clustering is a widely used technique for color quantization; all
colors in an image are grouped to small clusters. In this paper, we
proposed a new hybrid approach for color quantization using firefly
algorithm (FA) and K-means algorithm. Firefly algorithm is a swarmbased
algorithm that can be used for solving optimization problems.
The proposed method can overcome the drawbacks of both
algorithms such as the local optima converge problem in K-means
and the early converge of firefly algorithm. Experiments on three
commonly used images and the comparison results shows that the
proposed algorithm surpasses both the base-line technique k-means
clustering and original firefly algorithm.