Abstract: Data mining technique used in the field of clustering is a subject of active research and assists in biological pattern recognition and extraction of new knowledge from raw data. Clustering means the act of partitioning an unlabeled dataset into groups of similar objects. Each group, called a cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Several clustering methods are based on partitional clustering. This category attempts to directly decompose the dataset into a set of disjoint clusters leading to an integer number of clusters that optimizes a given criterion function. The criterion function may emphasize a local or a global structure of the data, and its optimization is an iterative relocation procedure. The K-Means algorithm is one of the most widely used partitional clustering techniques. Since K-Means is extremely sensitive to the initial choice of centers and a poor choice of centers may lead to a local optimum that is quite inferior to the global optimum, we propose a strategy to initiate K-Means centers. The improved K-Means algorithm is compared with the original K-Means, and the results prove how the efficiency has been significantly improved.
Abstract: Clustering is an intensive research for some years
because of its multifaceted applications, such as biology, information
retrieval, medicine, business and so on. The expectation maximization
(EM) is a kind of algorithm framework in clustering methods, one
of the ten algorithms of machine learning. Traditionally, optimization
of objective function has been the standard approach in EM. Hence,
research has investigated the utility of evolutionary computing and
related techniques in the regard. Chemical Reaction Optimization
(CRO) is a recently established method. So the property embedded
in CRO is used to solve optimization problems. This paper presents
an algorithm framework (EM-CRO) with modified CRO operators
based on EM cluster problems. The hybrid algorithm is mainly
to solve the problem of initial value sensitivity of the objective
function optimization clustering algorithm. Our experiments mainly
take the EM classic algorithm:k-means and fuzzy k-means as an
example, through the CRO algorithm to optimize its initial value, get
K-means-CRO and FKM-CRO algorithm. The experimental results
of them show that there is improved efficiency for solving objective
function optimization clustering problems.
Abstract: In this paper, an attribute weighting method called fuzzy C-means clustering based attribute weighting (FCMAW) for classification of Diabetes disease dataset has been used. The aims of this study are to reduce the variance within attributes of diabetes dataset and to improve the classification accuracy of classifier algorithm transforming from non-linear separable datasets to linearly separable datasets. Pima Indians Diabetes dataset has two classes including normal subjects (500 instances) and diabetes subjects (268 instances). Fuzzy C-means clustering is an improved version of K-means clustering method and is one of most used clustering methods in data mining and machine learning applications. In this study, as the first stage, fuzzy C-means clustering process has been used for finding the centers of attributes in Pima Indians diabetes dataset and then weighted the dataset according to the ratios of the means of attributes to centers of theirs. Secondly, after weighting process, the classifier algorithms including support vector machine (SVM) and k-NN (k- nearest neighbor) classifiers have been used for classifying weighted Pima Indians diabetes dataset. Experimental results show that the proposed attribute weighting method (FCMAW) has obtained very promising results in the classification of Pima Indians diabetes dataset.
Abstract: Stock investment decisions are often made based on current events of the global economy and the analysis of historical data. Conversely, visual representation could assist investors’ gain deeper understanding and better insight on stock market trends more efficiently. The trend analysis is based on long-term data collection. The study adopts a hybrid method that combines the Clustering algorithm and Force-directed algorithm to overcome the scalability problem when visualizing large data. This method exemplifies the potential relationships between each stock, as well as determining the degree of strength and connectivity, which will provide investors another understanding of the stock relationship for reference. Information derived from visualization will also help them make an informed decision. The results of the experiments show that the proposed method is able to produced visualized data aesthetically by providing clearer views for connectivity and edge weights.
Abstract: Rough set theory is used to handle uncertainty and incomplete information by applying two accurate sets, Lower approximation and Upper approximation. In this paper, the rough clustering algorithms are improved by adopting the Similarity, Dissimilarity–Similarity and Entropy based initial centroids selection method on three different clustering algorithms namely Entropy based Rough K-Means (ERKM), Similarity based Rough K-Means (SRKM) and Dissimilarity-Similarity based Rough K-Means (DSRKM) were developed and executed by yeast dataset. The rough clustering algorithms are validated by cluster validity indexes namely Rand and Adjusted Rand indexes. An experimental result shows that the ERKM clustering algorithm perform effectively and delivers better results than other clustering methods. Outlier detection is an important task in data mining and very much different from the rest of the objects in the clusters. Entropy based Rough Outlier Factor (EROF) method is seemly to detect outlier effectively for yeast dataset. In rough K-Means method, by tuning the epsilon (ᶓ) value from 0.8 to 1.08 can detect outliers on boundary region and the RKM algorithm delivers better results, when choosing the value of epsilon (ᶓ) in the specified range. An experimental result shows that the EROF method on clustering algorithm performed very well and suitable for detecting outlier effectively for all datasets. Further, experimental readings show that the ERKM clustering method outperformed the other methods.
Abstract: Wireless Sensor Networks consist of inexpensive, low power sensor nodes deployed to monitor the environment and collect
data. Gathering information in an energy efficient manner is a critical aspect to prolong the network lifetime. Clustering algorithms have an advantage of enhancing the network lifetime. Current clustering algorithms usually focus on global re-clustering and local re-clustering separately. This paper, proposed a combination of those two reclustering methods to reduce the energy consumption of the network. Furthermore, the proposed algorithm can apply to homogeneous as well as heterogeneous wireless sensor networks. In addition, the cluster head rotation happens, only when its energy drops below a dynamic threshold value computed by the algorithm. The simulation result shows that the proposed algorithm prolong the network lifetime compared to existing algorithms.
Abstract: Many research works are carried out on the analysis of
traces in a digital learning environment. These studies produce large
volumes of usage tracks from the various actions performed by a
user. However, to exploit these data, compare and improve
performance, several issues are raised. To remedy this, several works
deal with this problem seen recently. This research studied a series of
questions about format and description of the data to be shared. Our
goal is to share thoughts on these issues by presenting our experience
in the analysis of trace-based log files, comparing several approaches
used in automatic classification applied to e-learning platforms.
Finally, the obtained results are discussed.
Abstract: Knowing about the customer behavior in a grocery has
been a long-standing issue in the retailing industry. The advent of
RFID has made it easier to collect moving data for an individual
shopper's behavior. Most of the previous studies used the traditional
statistical clustering technique to find the major characteristics of
customer behavior, especially shopping path. However, in using the
clustering technique, due to various spatial constraints in the store,
standard clustering methods are not feasible because moving data such
as the shopping path should be adjusted in advance of the analysis,
which is time-consuming and causes data distortion. To alleviate this
problem, we propose a new approach to spatial pattern clustering
based on the longest common subsequence. Experimental results using
real data obtained from a grocery confirm the good performance of the
proposed method in finding the hot spot, dead spot and major path
patterns of customer movements.
Abstract: Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.
Abstract: Image clustering is a process of grouping images
based on their similarity. The image clustering usually uses the color
component, texture, edge, shape, or mixture of two components, etc.
This research aims to explore image clustering using color
composition. In order to complete this image clustering, three main
components should be considered, which are color space, image
representation (feature extraction), and clustering method itself. We
aim to explore which composition of these factors will produce the
best clustering results by combining various techniques from the
three components. The color spaces use RGB, HSV, and L*a*b*
method. The image representations use Histogram and Gaussian
Mixture Model (GMM), whereas the clustering methods use KMeans
and Agglomerative Hierarchical Clustering algorithm. The
results of the experiment show that GMM representation is better
combined with RGB and L*a*b* color space, whereas Histogram is
better combined with HSV. The experiments also show that K-Means
is better than Agglomerative Hierarchical for images clustering.
Abstract: Methods for organizing web data into groups in order
to analyze web-based hypertext data and facilitate data availability
are very important in terms of the number of documents available
online. Thereby, the task of clustering web-based document structures
has many applications, e.g., improving information retrieval on the
web, better understanding of user navigation behavior, improving web
users requests servicing, and increasing web information accessibility.
In this paper we investigate a new approach for clustering web-based
hypertexts on the basis of their graph structures. The hypertexts will
be represented as so called generalized trees which are more general
than usual directed rooted trees, e.g., DOM-Trees. As a important
preprocessing step we measure the structural similarity between the
generalized trees on the basis of a similarity measure d. Then,
we apply agglomerative clustering to the obtained similarity matrix
in order to create clusters of hypertext graph patterns representing
navigation structures. In the present paper we will run our approach
on a data set of hypertext structures and obtain good results in
Web Structure Mining. Furthermore we outline the application of
our approach in Web Usage Mining as future work.
Abstract: Understanding the cell's large-scale organization is an interesting task in computational biology. Thus, protein-protein interactions can reveal important organization and function of the cell. Here, we investigated the correspondence between protein interactions and function for the yeast. We obtained the correlations among the set of proteins. Then these correlations are clustered using both the hierarchical and biclustering methods. The detailed analyses of proteins in each cluster were carried out by making use of their functional annotations. As a result, we found that some functional classes appear together in almost all biclusters. On the other hand, in hierarchical clustering, the dominancy of one functional class is observed. In the light of the clustering data, we have verified some interactions which were not identified as core interactions in DIP and also, we have characterized some functionally unknown proteins according to the interaction data and functional correlation. In brief, from interaction data to function, some correlated results are noticed about the relationship between interaction and function which might give clues about the organization of the proteins, also to predict new interactions and to characterize functions of unknown proteins.
Abstract: This work presents a neural network model for the
clustering analysis of data based on Self Organizing Maps (SOM).
The model evolves during the training stage towards a hierarchical
structure according to the input requirements. The hierarchical structure
symbolizes a specialization tool that provides refinements of the
classification process. The structure behaves like a single map with
different resolutions depending on the region to analyze. The benefits
and performance of the algorithm are discussed in application to the
Iris dataset, a classical example for pattern recognition.
Abstract: With deep development of software reuse, componentrelated
technologies have been widely applied in the development of
large-scale complex applications. Component identification (CI) is
one of the primary research problems in software reuse, by analyzing
domain business models to get a set of business components with high
reuse value and good reuse performance to support effective reuse.
Based on the concept and classification of CI, its technical stack is
briefly discussed from four views, i.e., form of input business models,
identification goals, identification strategies, and identification
process. Then various CI methods presented in literatures are
classified into four types, i.e., domain analysis based methods,
cohesion-coupling based clustering methods, CRUD matrix based
methods, and other methods, with the comparisons between these
methods for their advantages and disadvantages. Additionally, some
insufficiencies of study on CI are discussed, and the causes are
explained subsequently. Finally, it is concluded with some
significantly promising tendency about research on this problem.
Abstract: This paper develops a quality estimation method with
the application of fuzzy hierarchical clustering. Quality estimation is
essential to quality control and quality improvement as a precise
estimation can promote a right decision-making in order to help
better quality control. Normally the quality of finished products in
manufacturing system can be differentiated by quality standards. In
the real life situation, the collected data may be vague which is not
easy to be classified and they are usually represented in term of fuzzy
number. To estimate the quality of product presented by fuzzy
number is not easy. In this research, the trapezoidal fuzzy numbers
are collected in manufacturing process and classify the collected data
into different clusters so as to get the estimation. Since normal
hierarchical clustering methods can only be applied for real numbers,
fuzzy hierarchical clustering is selected to handle this problem based
on quality standards.
Abstract: Clustering techniques have been used by many intelligent software agents to group similar access patterns of the Web users into high level themes which express users intentions and interests. However, such techniques have been mostly focusing on one salient feature of the Web document visited by the user, namely the extracted keywords. The major aim of these techniques is to come up with an optimal threshold for the number of keywords needed to produce more focused themes. In this paper we focus on both keyword and similarity thresholds to generate themes with concentrated themes, and hence build a more sound model of the user behavior. The purpose of this paper is two fold: use distance based clustering methods to recognize overall themes from the Proxy log file, and suggest an efficient cut off levels for the keyword and similarity thresholds which tend to produce more optimal clusters with better focus and efficient size.
Abstract: In this paper, we propose an energy efficient cluster
based communication protocol for wireless sensor network. Our
protocol considers both the residual energy of sensor nodes and the
distance of each node from the BS when selecting cluster-head. This
protocol can successfully prolong the network-s lifetime by 1)
reducing the total energy dissipation on the network and 2) evenly
distributing energy consumption over all sensor nodes. In this
protocol, the nodes with more energy and less distance from the BS
are probable to be selected as cluster-head. Simulation results with
MATLAB show that proposed protocol could increase the lifetime of
network more than 94% for first node die (FND), and more than 6%
for the half of the nodes alive (HNA) factor as compared with
conventional protocols.
Abstract: Understanding the cell's large-scale organization is an
interesting task in computational biology. Thus, protein-protein
interactions can reveal important organization and function of the
cell. Here, we investigated the correspondence between protein
interactions and function for the yeast. We obtained the correlations
among the set of proteins. Then these correlations are clustered using
both the hierarchical and biclustering methods. The detailed analyses
of proteins in each cluster were carried out by making use of their
functional annotations. As a result, we found that some functional
classes appear together in almost all biclusters. On the other hand, in
hierarchical clustering, the dominancy of one functional class is
observed. In brief, from interaction data to function, some correlated
results are noticed about the relationship between interaction and
function which might give clues about the organization of the
proteins.
Abstract: Partitioning is a critical area of VLSI CAD. In order to build complex digital logic circuits its often essential to sub-divide multi -million transistor design into manageable Pieces. This paper looks at the various partitioning techniques aspects of VLSI CAD, targeted at various applications. We proposed an evolutionary time-series model and a statistical glitch prediction system using a neural network with selection of global feature by making use of clustering method model, for partitioning a circuit. For evolutionary time-series model, we made use of genetic, memetic & neuro-memetic techniques. Our work focused in use of clustering methods - K-means & EM methodology. A comparative study is provided for all techniques to solve the problem of circuit partitioning pertaining to VLSI design. The performance of all approaches is compared using benchmark data provided by MCNC standard cell placement benchmark net lists. Analysis of the investigational results proved that the Neuro-memetic model achieves greater performance then other model in recognizing sub-circuits with minimum amount of interconnections between them.
Abstract: It is important problems to increase the detection rates
and reduce false positive rates in Intrusion Detection System (IDS).
Although preventative techniques such as access control and
authentication attempt to prevent intruders, these can fail, and as a
second line of defence, intrusion detection has been introduced. Rare
events are events that occur very infrequently, detection of rare
events is a common problem in many domains. In this paper we
propose an intrusion detection method that combines Rough set and
Fuzzy Clustering. Rough set has to decrease the amount of data and
get rid of redundancy. Fuzzy c-means clustering allow objects to
belong to several clusters simultaneously, with different degrees of
membership. Our approach allows us to recognize not only known
attacks but also to detect suspicious activity that may be the result of
a new, unknown attack. The experimental results on Knowledge
Discovery and Data Mining-(KDDCup 1999) Dataset show that the
method is efficient and practical for intrusion detection systems.