Abstract: Search is the most obvious application of information
retrieval. The variety of widely obtainable biomedical data is
enormous and is expanding fast. This expansion makes the existing
techniques are not enough to extract the most interesting patterns
from the collection as per the user requirement. Recent researches are
concentrating more on semantic based searching than the traditional
term based searches. Algorithms for semantic searches are
implemented based on the relations exist between the words of the
documents. Ontologies are used as domain knowledge for identifying
the semantic relations as well as to structure the data for effective
information retrieval. Annotation of data with concepts of ontology is
one of the wide-ranging practices for clustering the documents. In
this paper, indexing based on concept and annotation are proposed
for clustering the biomedical documents. Fuzzy c-means (FCM)
clustering algorithm is used to cluster the documents. The
performances of the proposed methods are analyzed with traditional
term based clustering for PubMed articles in five different diseases
communities. The experimental results show that the proposed
methods outperform the term based fuzzy clustering.
Abstract: Textual data plays an important role in the modern
world. The possibilities of applying data mining techniques to
uncover hidden information present in large volumes of text
collections is immense. The Growing Self Organizing Map (GSOM)
is a highly successful member of the Self Organising Map family
and has been used as a clustering and visualisation tool across wide
range of disciplines to discover hidden patterns present in the data.
A comprehensive analysis of the GSOM’s capabilities as a text
clustering and visualisation tool has so far not been published. These
functionalities, namely map visualisation capabilities, automatic
cluster identification and hierarchical clustering capabilities are
presented in this paper and are further demonstrated with experiments
on a benchmark text corpus.
Abstract: Depending on the big data analysis becomes important, yield prediction using data from the semiconductor process is essential. In general, yield prediction and analysis of the causes of the failure are closely related. The purpose of this study is to analyze pattern affects the final test results using a die map based clustering. Many researches have been conducted using die data from the semiconductor test process. However, analysis has limitation as the test data is less directly related to the final test results. Therefore, this study proposes a framework for analysis through clustering using more detailed data than existing die data. This study consists of three phases. In the first phase, die map is created through fail bit data in each sub-area of die. In the second phase, clustering using map data is performed. And the third stage is to find patterns that affect final test result. Finally, the proposed three steps are applied to actual industrial data and experimental results showed the potential field application.
Abstract: Over the past epoch a rampant amount of work has been done in the data clustering research under the unsupervised learning technique in Data mining. Furthermore several algorithms and methods have been proposed focusing on clustering different data types, representation of cluster models, and accuracy rates of the clusters. However no single clustering algorithm proves to be the most efficient in providing best results. Accordingly in order to find the solution to this issue a new technique, called Cluster ensemble method was bloomed. This cluster ensemble is a good alternative approach for facing the cluster analysis problem. The main hope of the cluster ensemble is to merge different clustering solutions in such a way to achieve accuracy and to improve the quality of individual data clustering. Due to the substantial and unremitting development of new methods in the sphere of data mining and also the incessant interest in inventing new algorithms, makes obligatory to scrutinize a critical analysis of the existing techniques and the future novelty. This paper exposes the comparative study of different cluster ensemble methods along with their features, systematic working process and the average accuracy and error rates of each ensemble methods. Consequently this speculative and comprehensive analysis will be very useful for the community of clustering practitioners and also helps in deciding the most suitable one to rectify the problem in hand.
Abstract: Psoriasis is a chronic inflammatory skin condition
which affects 2-3% of population around the world. Psoriasis Area
and Severity Index (PASI) is a gold standard to assess psoriasis
severity as well as the treatment efficacy. Although a gold standard,
PASI is rarely used because it is tedious and complex. In practice,
PASI score is determined subjectively by dermatologists, therefore
inter and intra variations of assessment are possible to happen even
among expert dermatologists. This research develops an algorithm to
assess psoriasis lesion for PASI scoring objectively. Focus of this
research is thickness assessment as one of PASI four parameters
beside area, erythema and scaliness. Psoriasis lesion thickness is
measured by averaging the total elevation from lesion base to lesion
surface. Thickness values of 122 3D images taken from 39 patients
are grouped into 4 PASI thickness score using K-means clustering.
Validation on lesion base construction is performed using twelve
body curvature models and show good result with coefficient of
determinant (R2) is equal to 1.
Abstract: In the world of Peer-to-Peer (P2P) networking
different protocols have been developed to make the resource sharing
or information retrieval more efficient. The SemPeer protocol is a
new layer on Gnutella that transforms the connections of the nodes
based on semantic information to make information retrieval more
efficient. However, this transformation causes high clustering in the
network that decreases the number of nodes reached, therefore the
probability of finding a document is also decreased. In this paper we
describe a mathematical model for the Gnutella and SemPeer
protocols that captures clustering-related issues, followed by a
proposition to modify the SemPeer protocol to achieve moderate
clustering. This modification is a sort of link management for the
individual nodes that allows the SemPeer protocol to be more
efficient, because the probability of a successful query in the P2P
network is reasonably increased. For the validation of the models, we
evaluated a series of simulations that supported our results.
Abstract: Clustering is the process of subdividing an input data set into a desired number of subgroups so that members of the same subgroup are similar and members of different subgroups have diverse properties. Many heuristic algorithms have been applied to the clustering problem, which is known to be NP Hard. Genetic algorithms have been used in a wide variety of fields to perform clustering, however, the technique normally has a long running time in terms of input set size. This paper proposes an efficient genetic algorithm for clustering on very large data sets, especially on image data sets. The genetic algorithm uses the most time efficient techniques along with preprocessing of the input data set. We test our algorithm on both artificial and real image data sets, both of which are of large size. The experimental results show that our algorithm outperforms the k-means algorithm in terms of running time as well as the quality of the clustering.
Abstract: K-Means (KM) is considered one of the major
algorithms widely used in clustering. However, it still has some
problems, and one of them is in its initialization step where it is
normally done randomly. Another problem for KM is that it
converges to local minima. Genetic algorithms are one of the
evolutionary algorithms inspired from nature and utilized in the field
of clustering. In this paper, we propose two algorithms to solve the
initialization problem, Genetic Algorithm Initializes KM (GAIK) and
KM Initializes Genetic Algorithm (KIGA). To show the effectiveness
and efficiency of our algorithms, a comparative study was done
among GAIK, KIGA, Genetic-based Clustering Algorithm (GCA),
and FCM [19].
Abstract: In terms of total online audience, newspapers are the most successful form of online content to date. The online audience for newspapers continues to demand higher-quality services, including personalized news services. News providers should be able to offer suitable users appropriate content. In this paper, a news article recommender system is suggested based on a user-s preference when he or she visits an Internet news site and reads the published articles. This system helps raise the user-s satisfaction, increase customer loyalty toward the content provider.
Abstract: Clustering unstructured text documents is an
important issue in data mining community and has a number of
applications such as document archive filtering, document
organization and topic detection and subject tracing. In the real
world, some of the already clustered documents may not be of
importance while new documents of more significance may evolve.
Most of the work done so far in clustering unstructured text
documents overlooks this aspect of clustering. This paper, addresses
this issue by using the Fading Function. The unstructured text
documents are clustered. And for each cluster a statistics structure
called Cluster Profile (CP) is implemented. The cluster profile
incorporates the Fading Function. This Fading Function keeps an
account of the time-dependent importance of the cluster. The work
proposes a novel algorithm Clustering n-ary Merge Algorithm
(CnMA) for unstructured text documents, that uses Cluster Profile
and Fading Function. Experimental results illustrating the
effectiveness of the proposed technique are also included.
Abstract: Color image segmentation can be considered as a
cluster procedure in feature space. k-means and its adaptive
version, i.e. competitive learning approach are powerful tools
for data clustering. But k-means and competitive learning suffer
from several drawbacks such as dead-unit problem and need to
pre-specify number of cluster. In this paper, we will explore to
use competitive and cooperative learning approach to perform
color image segmentation. In competitive and cooperative
learning approach, seed points not only compete each other, but
also the winner will dynamically select several nearest
competitors to form a cooperative team to adapt to the input
together, finally it can automatically select the correct number
of cluster and avoid the dead-units problem. Experimental
results show that CCL can obtain better segmentation result.
Abstract: In this paper, we present a novel approach to accurately
detect text regions including shop name in signboard images with
complex background for mobile system applications. The proposed
method is based on the combination of text detection using edge
profile and region segmentation using fuzzy c-means method. In the
first step, we perform an elaborate canny edge operator to extract all
possible object edges. Then, edge profile analysis with vertical and
horizontal direction is performed on these edge pixels to detect
potential text region existing shop name in a signboard. The edge
profile and geometrical characteristics of each object contour are
carefully examined to construct candidate text regions and classify the
main text region from background. Finally, the fuzzy c-means
algorithm is performed to segment and detected binarize text region.
Experimental results show that our proposed method is robust in text
detection with respect to different character size and color and can
provide reliable text binarization result.
Abstract: Image clustering is a process of grouping images
based on their similarity. The image clustering usually uses the color
component, texture, edge, shape, or mixture of two components, etc.
This research aims to explore image clustering using color
composition. In order to complete this image clustering, three main
components should be considered, which are color space, image
representation (feature extraction), and clustering method itself. We
aim to explore which composition of these factors will produce the
best clustering results by combining various techniques from the
three components. The color spaces use RGB, HSV, and L*a*b*
method. The image representations use Histogram and Gaussian
Mixture Model (GMM), whereas the clustering methods use KMeans
and Agglomerative Hierarchical Clustering algorithm. The
results of the experiment show that GMM representation is better
combined with RGB and L*a*b* color space, whereas Histogram is
better combined with HSV. The experiments also show that K-Means
is better than Agglomerative Hierarchical for images clustering.
Abstract: In Data mining, Fuzzy clustering algorithms have
demonstrated advantage over crisp clustering algorithms in dealing
with the challenges posed by large collections of vague and uncertain
natural data. This paper reviews concept of fuzzy logic and fuzzy
clustering. The classical fuzzy c-means algorithm is presented and its
limitations are highlighted. Based on the study of the fuzzy c-means
algorithm and its extensions, we propose a modification to the cmeans
algorithm to overcome the limitations of it in calculating the
new cluster centers and in finding the membership values with
natural data. The efficiency of the new modified method is
demonstrated on real data collected for Bhutan-s Gross National
Happiness (GNH) program.
Abstract: Young patients suffering from Cerebral Palsy are
facing difficult choices concerning heavy surgeries. Diagnosis settled
by surgeons can be complex and on the other hand decision for
patient about getting or not such a surgery involves important
reflection effort. Proposed software combining prediction for
surgeries and post surgery kinematic values, and from 3D model
representing the patient is an innovative tool helpful for both patients
and medicine professionals. Beginning with analysis and
classification of kinematics values from Data Base extracted from
gait analysis in 3 separated clusters, it is possible to determine close
similarity between patients. Prediction surgery best adapted to
improve a patient gait is then determined by operating a suitable
preconditioned neural network. Finally, patient 3D modeling based
on kinematic values analysis, is animated thanks to post surgery
kinematic vectors characterizing the closest patient selected from
patients clustering.
Abstract: Duplicated region detection is a technical method to
expose copy-paste forgeries on digital images. Copy-paste is one
of the common types of forgeries to clone portion of an image
in order to conceal or duplicate special object. In this type of
forgery detection, extracting robust block feature and also high
time complexity of matching step are two main open problems.
This paper concentrates on computational time and proposes a local
block matching algorithm based on block clustering to enhance time
complexity. Time complexity of the proposed algorithm is formulated
and effects of two parameter, block size and number of cluster, on
efficiency of this algorithm are considered. The experimental results
and mathematical analysis demonstrate this algorithm is more costeffective
than lexicographically algorithms in time complexity issue
when the image is complex.
Abstract: Clustering techniques have received attention in many areas including engineering, medicine, biology and data mining. The purpose of clustering is to group together data points, which are close to one another. The K-means algorithm is one of the most widely used techniques for clustering. However, K-means has two shortcomings: dependency on the initial state and convergence to local optima and global solutions of large problems cannot found with reasonable amount of computation effort. In order to overcome local optima problem lots of studies done in clustering. This paper is presented an efficient hybrid evolutionary optimization algorithm based on combining Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO), called PSO-ACO, for optimally clustering N object into K clusters. The new PSO-ACO algorithm is tested on several data sets, and its performance is compared with those of ACO, PSO and K-means clustering. The simulation results show that the proposed evolutionary optimization algorithm is robust and suitable for handing data clustering.
Abstract: Cluster analysis divides data into groups that are
meaningful, useful, or both. Analysis of biological data is creating a
new generation of epidemiologic, prognostic, diagnostic and
treatment modalities. Clustering of protein sequences is one of the
current research topics in the field of computer science. Linear
relation is valuable in rule discovery for a given data, such as if value
X goes up 1, value Y will go down 3", etc. The classical linear
regression models the linear relation of two sequences perfectly.
However, if we need to cluster a large repository of protein sequences
into groups where sequences have strong linear relationship with
each other, it is prohibitively expensive to compare sequences one by
one. In this paper, we propose a new technique named General
Regression Model Technique Clustering Algorithm (GRMTCA) to
benignly handle the problem of linear sequences clustering. GRMT
gives a measure, GR*, to tell the degree of linearity of multiple
sequences without having to compare each pair of them.
Abstract: In order to accelerate the similarity search in highdimensional database, we propose a new hierarchical indexing method. It is composed of offline and online phases. Our contribution concerns both phases. In the offline phase, after gathering the whole of the data in clusters and constructing a hierarchical index, the main originality of our contribution consists to develop a method to construct bounding forms of clusters to avoid overlapping. For the online phase, our idea improves considerably performances of similarity search. However, for this second phase, we have also developed an adapted search algorithm. Our method baptized NOHIS (Non-Overlapping Hierarchical Index Structure) use the Principal Direction Divisive Partitioning (PDDP) as algorithm of clustering. The principle of the PDDP is to divide data recursively into two sub-clusters; division is done by using the hyper-plane orthogonal to the principal direction derived from the covariance matrix and passing through the centroid of the cluster to divide. Data of each two sub-clusters obtained are including by a minimum bounding rectangle (MBR). The two MBRs are directed according to the principal direction. Consequently, the nonoverlapping between the two forms is assured. Experiments use databases containing image descriptors. Results show that the proposed method outperforms sequential scan and SRtree in processing k-nearest neighbors.
Abstract: Understanding the cell's large-scale organization is an interesting task in computational biology. Thus, protein-protein interactions can reveal important organization and function of the cell. Here, we investigated the correspondence between protein interactions and function for the yeast. We obtained the correlations among the set of proteins. Then these correlations are clustered using both the hierarchical and biclustering methods. The detailed analyses of proteins in each cluster were carried out by making use of their functional annotations. As a result, we found that some functional classes appear together in almost all biclusters. On the other hand, in hierarchical clustering, the dominancy of one functional class is observed. In the light of the clustering data, we have verified some interactions which were not identified as core interactions in DIP and also, we have characterized some functionally unknown proteins according to the interaction data and functional correlation. In brief, from interaction data to function, some correlated results are noticed about the relationship between interaction and function which might give clues about the organization of the proteins, also to predict new interactions and to characterize functions of unknown proteins.