Abstract: Rough set theory is used to handle uncertainty and incomplete information by applying two accurate sets, Lower approximation and Upper approximation. In this paper, the rough clustering algorithms are improved by adopting the Similarity, Dissimilarity–Similarity and Entropy based initial centroids selection method on three different clustering algorithms namely Entropy based Rough K-Means (ERKM), Similarity based Rough K-Means (SRKM) and Dissimilarity-Similarity based Rough K-Means (DSRKM) were developed and executed by yeast dataset. The rough clustering algorithms are validated by cluster validity indexes namely Rand and Adjusted Rand indexes. An experimental result shows that the ERKM clustering algorithm perform effectively and delivers better results than other clustering methods. Outlier detection is an important task in data mining and very much different from the rest of the objects in the clusters. Entropy based Rough Outlier Factor (EROF) method is seemly to detect outlier effectively for yeast dataset. In rough K-Means method, by tuning the epsilon (ᶓ) value from 0.8 to 1.08 can detect outliers on boundary region and the RKM algorithm delivers better results, when choosing the value of epsilon (ᶓ) in the specified range. An experimental result shows that the EROF method on clustering algorithm performed very well and suitable for detecting outlier effectively for all datasets. Further, experimental readings show that the ERKM clustering method outperformed the other methods.
Abstract: Clustering is a process of grouping objects and data
into groups of clusters to ensure that data objects from the same
cluster are identical to each other. Clustering algorithms in one of the
area in data mining and it can be classified into partition, hierarchical,
density based and grid based. Therefore, in this paper we do survey
and review four major hierarchical clustering algorithms called
CURE, ROCK, CHAMELEON and BIRCH. The obtained state of
the art of these algorithms will help in eliminating the current
problems as well as deriving more robust and scalable algorithms for
clustering.
Abstract: Web mining is to discover and extract useful
Information. Different users may have different search goals when
they search by giving queries and submitting it to a search engine.
The inference and analysis of user search goals can be very useful for
providing an experience result for a user search query. In this project,
we propose a novel approach to infer user search goals by analyzing
search web logs. First, we propose a novel approach to infer user
search goals by analyzing search engine query logs, the feedback
sessions are constructed from user click-through logs and it
efficiently reflect the information needed for users. Second we
propose a preprocessing technique to clean the unnecessary data’s
from web log file (feedback session). Third we propose a technique
to generate pseudo-documents to representation of feedback sessions
for clustering. Finally we implement k-medoids clustering algorithm
to discover different user search goals and to provide a more optimal
result for a search query based on feedback sessions for the user.
Abstract: Speaker Identification (SI) is the task of establishing
identity of an individual based on his/her voice characteristics. The SI
task is typically achieved by two-stage signal processing: training and
testing. The training process calculates speaker specific feature
parameters from the speech and generates speaker models
accordingly. In the testing phase, speech samples from unknown
speakers are compared with the models and classified. Even though
performance of speaker identification systems has improved due to
recent advances in speech processing techniques, there is still need of
improvement. In this paper, a Closed-Set Tex-Independent Speaker
Identification System (CISI) based on a Multiple Classifier System
(MCS) is proposed, using Mel Frequency Cepstrum Coefficient
(MFCC) as feature extraction and suitable combination of vector
quantization (VQ) and Gaussian Mixture Model (GMM) together
with Expectation Maximization algorithm (EM) for speaker
modeling. The use of Voice Activity Detector (VAD) with a hybrid
approach based on Short Time Energy (STE) and Statistical
Modeling of Background Noise in the pre-processing step of the
feature extraction yields a better and more robust automatic speaker
identification system. Also investigation of Linde-Buzo-Gray (LBG)
clustering algorithm for initialization of GMM, for estimating the
underlying parameters, in the EM step improved the convergence rate
and systems performance. It also uses relative index as confidence
measures in case of contradiction in identification process by GMM
and VQ as well. Simulation results carried out on voxforge.org
speech database using MATLAB highlight the efficacy of the
proposed method compared to earlier work.
Abstract: This paper is concerned with knowledge representation
and extraction of fuzzy if-then rules using Interval Type-2
Context-based Fuzzy C-Means clustering (IT2-CFCM) with the aid of
fuzzy granulation. This proposed clustering algorithm is based on
information granulation in the form of IT2 based Fuzzy C-Means
(IT2-FCM) clustering and estimates the cluster centers by preserving
the homogeneity between the clustered patterns from the IT2 contexts
produced in the output space. Furthermore, we can obtain the
automatic knowledge representation in the design of Radial Basis
Function Networks (RBFN), Linguistic Model (LM), and Adaptive
Neuro-Fuzzy Networks (ANFN) from the numerical input-output data
pairs. We shall focus on a design of ANFN in this paper. The
experimental results on an estimation problem of energy performance
reveal that the proposed method showed a good knowledge
representation and performance in comparison with the previous
works.
Abstract: Clustering involves the partitioning of n objects into k
clusters. Many clustering algorithms use hard-partitioning techniques
where each object is assigned to one cluster. In this paper we propose
an overlapping algorithm MCOKE which allows objects to belong to
one or more clusters. The algorithm is different from fuzzy clustering
techniques because objects that overlap are assigned a membership
value of 1 (one) as opposed to a fuzzy membership degree. The
algorithm is also different from other overlapping algorithms that
require a similarity threshold be defined a priori which can be
difficult to determine by novice users.
Abstract: Leukaemia is a blood cancer disease that contributes
to the increment of mortality rate in Malaysia each year. There are
two main categories for leukaemia, which are acute and chronic
leukaemia. The production and development of acute leukaemia cells
occurs rapidly and uncontrollable. Therefore, if the identification of
acute leukaemia cells could be done fast and effectively, proper
treatment and medicine could be delivered. Due to the requirement of
prompt and accurate diagnosis of leukaemia, the current study has
proposed unsupervised pixel segmentation based on clustering
algorithm in order to obtain a fully segmented abnormal white blood
cell (blast) in acute leukaemia image. In order to obtain the
segmented blast, the current study proposed three clustering
algorithms which are k-means, fuzzy c-means and moving k-means
algorithms have been applied on the saturation component image.
Then, median filter and seeded region growing area extraction
algorithms have been applied, to smooth the region of segmented
blast and to remove the large unwanted regions from the image,
respectively. Comparisons among the three clustering algorithms are
made in order to measure the performance of each clustering
algorithm on segmenting the blast area. Based on the good sensitivity
value that has been obtained, the results indicate that moving kmeans
clustering algorithm has successfully produced the fully
segmented blast region in acute leukaemia image. Hence, indicating
that the resultant images could be helpful to haematologists for
further analysis of acute leukaemia.
Abstract: An extensive amount of work has been done in data
clustering research under the unsupervised learning technique in Data
Mining during the past two decades. Moreover, several approaches
and methods have been emerged focusing on clustering diverse data
types, features of cluster models and similarity rates of clusters.
However, none of the single clustering algorithm exemplifies its best
nature in extracting efficient clusters. Consequently, in order to
rectify this issue, a new challenging technique called Cluster
Ensemble method was bloomed. This new approach tends to be the
alternative method for the cluster analysis problem. The main
objective of the Cluster Ensemble is to aggregate the diverse
clustering solutions in such a way to attain accuracy and also to
improve the eminence the individual clustering algorithms. Due to
the massive and rapid development of new methods in the globe of
data mining, it is highly mandatory to scrutinize a vital analysis of
existing techniques and the future novelty. This paper shows the
comparative analysis of different cluster ensemble methods along
with their methodologies and salient features. Henceforth this
unambiguous analysis will be very useful for the society of clustering
experts and also helps in deciding the most appropriate one to resolve
the problem in hand.
Abstract: Search is the most obvious application of information
retrieval. The variety of widely obtainable biomedical data is
enormous and is expanding fast. This expansion makes the existing
techniques are not enough to extract the most interesting patterns
from the collection as per the user requirement. Recent researches are
concentrating more on semantic based searching than the traditional
term based searches. Algorithms for semantic searches are
implemented based on the relations exist between the words of the
documents. Ontologies are used as domain knowledge for identifying
the semantic relations as well as to structure the data for effective
information retrieval. Annotation of data with concepts of ontology is
one of the wide-ranging practices for clustering the documents. In
this paper, indexing based on concept and annotation are proposed
for clustering the biomedical documents. Fuzzy c-means (FCM)
clustering algorithm is used to cluster the documents. The
performances of the proposed methods are analyzed with traditional
term based clustering for PubMed articles in five different diseases
communities. The experimental results show that the proposed
methods outperform the term based fuzzy clustering.
Abstract: Human face has a fundamental role in the appearance
of individuals. So the importance of facial surgeries is undeniable.
Thus, there is a need for the appropriate and accurate facial skin
segmentation in order to extract different features. Since Fuzzy CMeans
(FCM) clustering algorithm doesn’t work appropriately for
noisy images and outliers, in this paper we exploit Possibilistic CMeans
(PCM) algorithm in order to segment the facial skin. For this
purpose, first, we convert facial images from RGB to YCbCr color
space. To evaluate performance of the proposed algorithm, the
database of Sahand University of Technology, Tabriz, Iran was used.
In order to have a better understanding from the proposed algorithm;
FCM and Expectation-Maximization (EM) algorithms are also used
for facial skin segmentation. The proposed method shows better
results than the other segmentation methods. Results include
misclassification error (0.032) and the region’s area error (0.045) for
the proposed algorithm.
Abstract: The aim of this paper is to assess the influence of several indicators determining innovativeness of countries' economies by applying selected soft computing methods. Such methods enable us to identify correlations between indicators for period 2006-2010. The main attention in the paper is focused on selecting proper computer tools for solving this problem. As a tool supporting identification, the X-means clustering algorithm, the Apriori rules generation algorithm as well as Self-Organizing Feature Maps (SOMs) have been selected. The paper has rather a rudimentary character. We briefly describe usefulness of the selected approaches and indicate some challenges for further research.
Abstract: Clustering in data mining is an unsupervised learning technique of aggregating the data objects into meaningful groups such that the intra cluster similarity of objects are maximized and inter cluster similarity of objects are minimized. Over the past decades several clustering tools were emerged in which clustering algorithms are inbuilt and are easier to use and extract the expected results. Data mining mainly deals with the huge databases that inflicts on cluster analysis and additional rigorous computational constraints. These challenges pave the way for the emergence of powerful expansive data mining clustering softwares. In this survey, a variety of clustering tools used in data mining are elucidated along with the pros and cons of each software.
Abstract: Over the past epoch a rampant amount of work has been done in the data clustering research under the unsupervised learning technique in Data mining. Furthermore several algorithms and methods have been proposed focusing on clustering different data types, representation of cluster models, and accuracy rates of the clusters. However no single clustering algorithm proves to be the most efficient in providing best results. Accordingly in order to find the solution to this issue a new technique, called Cluster ensemble method was bloomed. This cluster ensemble is a good alternative approach for facing the cluster analysis problem. The main hope of the cluster ensemble is to merge different clustering solutions in such a way to achieve accuracy and to improve the quality of individual data clustering. Due to the substantial and unremitting development of new methods in the sphere of data mining and also the incessant interest in inventing new algorithms, makes obligatory to scrutinize a critical analysis of the existing techniques and the future novelty. This paper exposes the comparative study of different cluster ensemble methods along with their features, systematic working process and the average accuracy and error rates of each ensemble methods. Consequently this speculative and comprehensive analysis will be very useful for the community of clustering practitioners and also helps in deciding the most suitable one to rectify the problem in hand.
Abstract: Considering the complexities involved in Cloud computing, there are still plenty of issues that affect the privacy of data in cloud environment. Unless these problems get solved, we think that the problem of preserving privacy in cloud databases is still open. In tokenization and homomorphic cryptography based solutions for privacy preserving cloud database querying, there is possibility that by colluding with service provider adversary may run brute force attacks that will reveal the attribute values.
In this paper we propose a solution by defining the variant of K –means clustering algorithm that effectively detects such brute force attacks and enhances privacy of cloud database querying by preventing this attacks.
Abstract: Support vector clustering (SVC) is an important kernelbased clustering algorithm in multi applications. It has got two main bottle necks, the high computation price and labeling piece. In this paper, we presented a modified SVC method, named Grid–SVC, to improve the original algorithm computationally. First we normalized and then we parted the interval, where the SVC is processing, using a novel Grid–based clustering algorithm. The algorithm parts the intervals, based on the density function of the data set and then applying the cartesian multiply makes multi-dimensional grids. Eliminating many outliers and noise in the preprocess, we apply an improved SVC method to each parted grid in a parallel way. The experimental results show both improvement in time complexity order and the accuracy.
Abstract: The goal of a network-based intrusion detection
system is to classify activities of network traffics into two major
categories: normal and attack (intrusive) activities. Nowadays, data
mining and machine learning plays an important role in many
sciences; including intrusion detection system (IDS) using both
supervised and unsupervised techniques. However, one of the
essential steps of data mining is feature selection that helps in
improving the efficiency, performance and prediction rate of
proposed approach. This paper applies unsupervised K-means
clustering algorithm with information gain (IG) for feature selection
and reduction to build a network intrusion detection system. For our
experimental analysis, we have used the new NSL-KDD dataset,
which is a modified dataset for KDDCup 1999 intrusion detection
benchmark dataset. With a split of 60.0% for the training set and the
remainder for the testing set, a 2 class classifications have been
implemented (Normal, Attack). Weka framework which is a java
based open source software consists of a collection of machine
learning algorithms for data mining tasks has been used in the testing
process. The experimental results show that the proposed approach is
very accurate with low false positive rate and high true positive rate
and it takes less learning time in comparison with using the full
features of the dataset with the same algorithm.
Abstract: This paper presents an optimal and unsupervised satellite image segmentation approach based on Pearson system and k-Means Clustering Algorithm Initialization. Such method could be considered as original by the fact that it utilised K-Means clustering algorithm for an optimal initialisation of image class number on one hand and it exploited Pearson system for an optimal statistical distributions- affectation of each considered class on the other hand. Satellite image exploitation requires the use of different approaches, especially those founded on the unsupervised statistical segmentation principle. Such approaches necessitate definition of several parameters like image class number, class variables- estimation and generalised mixture distributions. Use of statistical images- attributes assured convincing and promoting results under the condition of having an optimal initialisation step with appropriated statistical distributions- affectation. Pearson system associated with a k-means clustering algorithm and Stochastic Expectation-Maximization 'SEM' algorithm could be adapted to such problem. For each image-s class, Pearson system attributes one distribution type according to different parameters and especially the Skewness 'β1' and the kurtosis 'β2'. The different adapted algorithms, K-Means clustering algorithm, SEM algorithm and Pearson system algorithm, are then applied to satellite image segmentation problem. Efficiency of those combined algorithms was firstly validated with the Mean Quadratic Error 'MQE' evaluation, and secondly with visual inspection along several comparisons of these unsupervised images- segmentation.
Abstract: This paper presents an algorithm based on the
wavelet decomposition, for feature extraction from the ECG signal
and recognition of three types of Ventricular Arrhythmias using
neural networks. A set of Discrete Wavelet Transform (DWT)
coefficients, which contain the maximum information about the
arrhythmias, is selected from the wavelet decomposition. After that a
novel clustering algorithm based on nature inspired algorithm (Ant
Colony Optimization) is developed for classifying arrhythmia types.
The algorithm is applied on the ECG registrations from the MIT-BIH
arrhythmia and malignant ventricular arrhythmia databases. We
applied Daubechies 4 wavelet in our algorithm. The wavelet
decomposition enabled us to perform the task efficiently and
produced reliable results.
Abstract: Wireless Sensor Networks consist of inexpensive, low power sensor nodes deployed to monitor the environment and collect
data. Gathering information in an energy efficient manner is a critical aspect to prolong the network lifetime. Clustering algorithms have an advantage of enhancing the network lifetime. Current clustering algorithms usually focus on global re-clustering and local re-clustering separately. This paper, proposed a combination of those two reclustering methods to reduce the energy consumption of the network. Furthermore, the proposed algorithm can apply to homogeneous as well as heterogeneous wireless sensor networks. In addition, the cluster head rotation happens, only when its energy drops below a dynamic threshold value computed by the algorithm. The simulation result shows that the proposed algorithm prolong the network lifetime compared to existing algorithms.
Abstract: We compare three categorical data clustering
algorithms with respect to the problem of classifying cultural data
related to the aesthetic judgment of comics artists. Such a
classification is very important in Comics Art theory since the
determination of any classes of similarities in such kind of data will
provide to art-historians very fruitful information of Comics Art-s
evolution. To establish this, we use a categorical data set and we
study it by employing three categorical data clustering algorithms.
The performances of these algorithms are compared each other,
while interpretations of the clustering results are also given.