Abstract: An automated wood recognition system is designed to
classify tropical wood species.The wood features are extracted based
on two feature extractors: Basic Grey Level Aura Matrix (BGLAM)
technique and statistical properties of pores distribution (SPPD)
technique. Due to the nonlinearity of the tropical wood species
separation boundaries, a pre classification stage is proposed which
consists ofKmeans clusteringand kernel discriminant analysis (KDA).
Finally, Linear Discriminant Analysis (LDA) classifier and KNearest
Neighbour (KNN) are implemented for comparison purposes.
The study involves comparison of the system with and without pre
classification using KNN classifier and LDA classifier.The results
show that the inclusion of the pre classification stage has improved
the accuracy of both the LDA and KNN classifiers by more than
12%.
Abstract: This study proposes a new recommender system based on the collaborative folksonomy. The purpose of the proposed system is to recommend Internet resources (such as books, articles, documents, pictures, audio and video) to users. The proposed method includes four steps: creating the user profile based on the tags, grouping the similar users into clusters using an agglomerative hierarchical clustering, finding similar resources based on the user-s past collections by using content-based filtering, and recommending similar items to the target user. This study examines the system-s performance for the dataset collected from “del.icio.us," which is a famous social bookmarking website. Experimental results show that the proposed tag-based collaborative and content-based filtering hybridized recommender system is promising and effectiveness in the folksonomy-based bookmarking website.
Abstract: A new technique of topological multi-scale analysis is
introduced. By performing a clustering recursively to build a
hierarchy, and analyzing the co-scale and intra-scale similarities, an
Iterated Function System can be extracted from any data set. The study
of fractals shows that this method is efficient to extract
self-similarities, and can find elegant solutions the inverse problem of
building fractals. The theoretical aspects and practical
implementations are discussed, together with examples of analyses of
simple fractals.
Abstract: This paper presents a comparative analysis of a new
unsupervised PCA-based technique for steel plates texture segmentation
towards defect detection. The proposed scheme called Variance
Based Component Analysis or VBCA employs PCA for feature
extraction, applies a feature reduction algorithm based on variance of
eigenpictures and classifies the pixels as defective and normal. While
the classic PCA uses a clusterer like Kmeans for pixel clustering,
VBCA employs thresholding and some post processing operations to
label pixels as defective and normal. The experimental results show
that proposed algorithm called VBCA is 12.46% more accurate and
78.85% faster than the classic PCA.
Abstract: Granular computing deals with representation of information in the form of some aggregates and related methods for transformation and analysis for problem solving. A granulation scheme based on clustering and Rough Set Theory is presented with focus on structured conceptualization of information has been presented in this paper. Experiments for the proposed method on four labeled data exhibit good result with reference to classification problem. The proposed granulation technique is semi-supervised imbibing global as well as local information granulation. To represent the results of the attribute oriented granulation a tree structure is proposed in this paper.
Abstract: Segmentation in ultrasound images is challenging due to the interference from speckle noise and fuzziness of boundaries. In this paper, a segmentation scheme using fuzzy c-means (FCM) clustering incorporating both intensity and texture information of images is proposed to extract breast lesions in ultrasound images. Firstly, the nonlinear structure tensor, which can facilitate to refine the edges detected by intensity, is used to extract speckle texture. And then, a spatial FCM clustering is applied on the image feature space for segmentation. In the experiments with simulated and clinical ultrasound images, the spatial FCM clustering with both intensity and texture information gets more accurate results than the conventional FCM or spatial FCM without texture information.
Abstract: A clustering is process to identify a homogeneous
groups of object called as cluster. Clustering is one interesting topic
on data mining. A group or class behaves similarly characteristics.
This paper discusses a robust clustering process for data images with
two reduction dimension approaches; i.e. the two dimensional
principal component analysis (2DPCA) and principal component
analysis (PCA). A standard approach to overcome this problem is
dimension reduction, which transforms a high-dimensional data into
a lower-dimensional space with limited loss of information. One of
the most common forms of dimensionality reduction is the principal
components analysis (PCA). The 2DPCA is often called a variant of
principal component (PCA), the image matrices were directly treated
as 2D matrices; they do not need to be transformed into a vector so
that the covariance matrix of image can be constructed directly using
the original image matrices. The decomposed classical covariance
matrix is very sensitive to outlying observations. The objective of
paper is to compare the performance of robust minimizing vector
variance (MVV) in the two dimensional projection PCA (2DPCA)
and the PCA for clustering on an arbitrary data image when outliers
are hiden in the data set. The simulation aspects of robustness and
the illustration of clustering images are discussed in the end of
paper
Abstract: Need for an appropriate system of evaluating students-
educational developments is a key problem to achieve the predefined
educational goals. Intensity of the related papers in the last years; that
tries to proof or disproof the necessity and adequacy of the students
assessment; is the corroborator of this matter. Some of these studies
tried to increase the precision of determining question weights in
scientific examinations. But in all of them there has been an attempt
to adjust the initial question weights while the accuracy and precision
of those initial question weights are still under question. Thus In
order to increase the precision of the assessment process of students-
educational development, the present study tries to propose a new
method for determining the initial question weights by considering
the factors of questions like: difficulty, importance and complexity;
and implementing a combined method of PROMETHEE and fuzzy
analytic network process using a data mining approach to improve
the model-s inputs. The result of the implemented case study proves
the development of performance and precision of the proposed
model.
Abstract: For the past one decade, biclustering has become popular data mining technique not only in the field of biological data analysis but also in other applications like text mining, market data analysis with high-dimensional two-way datasets. Biclustering clusters both rows and columns of a dataset simultaneously, as opposed to traditional clustering which clusters either rows or columns of a dataset. It retrieves subgroups of objects that are similar in one subgroup of variables and different in the remaining variables. Firefly Algorithm (FA) is a recently-proposed metaheuristic inspired by the collective behavior of fireflies. This paper provides a preliminary assessment of discrete version of FA (DFA) while coping with the task of mining coherent and large volume bicluster from web usage dataset. The experiments were conducted on two web usage datasets from public dataset repository whereby the performance of FA was compared with that exhibited by other population-based metaheuristic called binary Particle Swarm Optimization (PSO). The results achieved demonstrate the usefulness of DFA while tackling the biclustering problem.
Abstract: This paper focuses on the data-driven generation
of fuzzy IF...THEN rules. The resulted fuzzy rule base can be
applied to build a classifier, a model used for prediction, or
it can be applied to form a decision support system. Among
the wide range of possible approaches, the decision tree and
the association rule based algorithms are overviewed, and two
new approaches are presented based on the a priori fuzzy
clustering based partitioning of the continuous input variables.
An application study is also presented, where the developed
methods are tested on the well known Wisconsin Breast Cancer
classification problem.
Abstract: System MEMORI automatically detects and recognizes
rotated and/or rescaled versions of the objects of a database within
digital color images with cluttered background. This task is accomplished
by means of a region grouping algorithm guided by heuristic
rules, whose parameters concern some geometrical properties and the
recognition score of the database objects. This paper focuses on the
strategies implemented in MEMORI for the estimation of the heuristic
rule parameters. This estimation, being automatic, makes the system
a self configuring and highly user-friendly tool.
Abstract: This paper presents a new growing neural network for
cluster analysis and market segmentation, which optimizes the size
and structure of clusters by iteratively checking them for multivariate
normality. We combine the recently published SGNN approach [8]
with the basic principle underlying the Gaussian-means algorithm
[13] and the Mardia test for multivariate normality [18, 19]. The new
approach distinguishes from existing ones by its holistic design and
its great autonomy regarding the clustering process as a whole. Its
performance is demonstrated by means of synthetic 2D data and by
real lifestyle survey data usable for market segmentation.
Abstract: We develop a three-step fuzzy logic-based algorithm for clustering categorical attributes, and we apply it to analyze cultural data. In the first step the algorithm employs an entropy-based clustering scheme, which initializes the cluster centers. In the second step we apply the fuzzy c-modes algorithm to obtain a fuzzy partition of the data set, and the third step introduces a novel cluster validity index, which decides the final number of clusters.
Abstract: In this paper three basic approaches and different
methods under each of them for extracting region of interest (ROI)
from stationary images are explored. The results obtained for each of
the proposed methods are shown, and it is demonstrated where each
method outperforms the other. Two main problems in ROI
extraction: the channel selection problem and the saliency reversal
problem are discussed and how best these two are addressed by
various methods is also seen. The basic approaches are 1) Saliency
based approach 2) Wavelet based approach 3) Clustering based
approach. The saliency approach performs well on images containing
objects of high saturation and brightness. The wavelet based
approach performs well on natural scene images that contain regions
of distinct textures. The mean shift clustering approach partitions the
image into regions according to the density distribution of pixel
intensities. The experimental results of various methodologies show
that each technique performs at different acceptable levels for
various types of images.
Abstract: Outlier detection in streaming data is very challenging because streaming data cannot be scanned multiple times and also new concepts may keep evolving. Irrelevant attributes can be termed as noisy attributes and such attributes further magnify the challenge of working with data streams. In this paper, we propose an unsupervised outlier detection scheme for streaming data. This scheme is based on clustering as clustering is an unsupervised data mining task and it does not require labeled data, both density based and partitioning clustering are combined for outlier detection. In this scheme partitioning clustering is also used to assign weights to attributes depending upon their respective relevance and weights are adaptive. Weighted attributes are helpful to reduce or remove the effect of noisy attributes. Keeping in view the challenges of streaming data, the proposed scheme is incremental and adaptive to concept evolution. Experimental results on synthetic and real world data sets show that our proposed approach outperforms other existing approach (CORM) in terms of outlier detection rate, false alarm rate, and increasing percentages of outliers.
Abstract: In this paper we focus on event extraction from Tamil
news article. This system utilizes a scoring scheme for extracting and
grouping event-specific sentences. Using this scoring scheme eventspecific
clustering is performed for multiple documents. Events are
extracted from each document using a scoring scheme based on
feature score and condition score. Similarly event specific sentences
are clustered from multiple documents using this scoring scheme.
The proposed system builds the Event Template based on user
specified query. The templates are filled with event specific details
like person, location and timeline extracted from the formed clusters.
The proposed system applies these methodologies for Tamil news
articles that have been enconverted into UNL graphs using a Tamil to
UNL-enconverter. The main intention of this work is to generate an
event based template.
Abstract: The proliferation of user-generated content (UGC) results in huge opportunities to explore event patterns. However, existing event recommendation systems primarily focus on advanced information technology users. Little work has been done to address novice and low-literacy users. The next billion users providing and consuming UGC are likely to include communities from developing countries who are ready to use affordable technologies for subsistence goals. Therefore, we propose a design framework for providing event recommendations to address the needs of such users. Grounded in information integration theory (IIT), our framework advocates that effective event recommendation is supported by systems capable of (1) reliable information gathering through structured user input, (2) accurate sense making through spatial-temporal analytics, and (3) intuitive information dissemination through interactive visualization techniques. A mobile pest management application is developed as an instantiation of the design framework. Our preliminary study suggests a set of design principles for novice and low-literacy users.
Abstract: Biclustering is a very useful data mining technique for
identifying patterns where different genes are co-related based on a
subset of conditions in gene expression analysis. Association rules
mining is an efficient approach to achieve biclustering as in
BIMODULE algorithm but it is sensitive to the value given to its
input parameters and the discretization procedure used in the
preprocessing step, also when noise is present, classical association
rules miners discover multiple small fragments of the true bicluster,
but miss the true bicluster itself. This paper formally presents a
generalized noise tolerant bicluster model, termed as μBicluster. An
iterative algorithm termed as BIDENS based on the proposed model
is introduced that can discover a set of k possibly overlapping
biclusters simultaneously. Our model uses a more flexible method to
partition the dimensions to preserve meaningful and significant
biclusters. The proposed algorithm allows discovering biclusters that
hard to be discovered by BIMODULE. Experimental study on yeast,
human gene expression data and several artificial datasets shows that
our algorithm offers substantial improvements over several
previously proposed biclustering algorithms.
Abstract: In the world of Peer-to-Peer (P2P) networking
different protocols have been developed to make the resource sharing
or information retrieval more efficient. The SemPeer protocol is a
new layer on Gnutella that transforms the connections of the nodes
based on semantic information to make information retrieval more
efficient. However, this transformation causes high clustering in the
network that decreases the number of nodes reached, therefore the
probability of finding a document is also decreased. In this paper we
describe a mathematical model for the Gnutella and SemPeer
protocols that captures clustering-related issues, followed by a
proposition to modify the SemPeer protocol to achieve moderate
clustering. This modification is a sort of link management for the
individual nodes that allows the SemPeer protocol to be more
efficient, because the probability of a successful query in the P2P
network is reasonably increased. For the validation of the models, we
evaluated a series of simulations that supported our results.
Abstract: Clustering is the process of subdividing an input data set into a desired number of subgroups so that members of the same subgroup are similar and members of different subgroups have diverse properties. Many heuristic algorithms have been applied to the clustering problem, which is known to be NP Hard. Genetic algorithms have been used in a wide variety of fields to perform clustering, however, the technique normally has a long running time in terms of input set size. This paper proposes an efficient genetic algorithm for clustering on very large data sets, especially on image data sets. The genetic algorithm uses the most time efficient techniques along with preprocessing of the input data set. We test our algorithm on both artificial and real image data sets, both of which are of large size. The experimental results show that our algorithm outperforms the k-means algorithm in terms of running time as well as the quality of the clustering.