Abstract: Data clustering is an important data exploration
technique with many applications in data mining. The k-means
algorithm is well known for its efficiency in clustering large data
sets. However, this algorithm is suitable for spherical shaped clusters
of similar sizes and densities. The quality of the resulting clusters
decreases when the data set contains spherical shaped with large
variance in sizes. In this paper, we introduce a competent procedure
to overcome this problem. The proposed method is based on shifting
the center of the large cluster toward the small cluster, and recomputing
the membership of small cluster points, the experimental
results reveal that the proposed algorithm produces satisfactory
results.
Abstract: The aim of this paper is to understand how peers can
influence adolescent girls- dieting behaviour and their body image.
Departing from imitation and social learning theories, we study
whether adolescent girls tend to model their peer group dieting
behaviours, thus influencing their body image construction. Our
study was conducted through an enquiry applied to a cluster sample
of 466 adolescent high school girls in Lisbon city public schools. Our
main findings point to an association between girls- and peers-
dieting behaviours, thus reinforcing the modelling hypothesis.
Abstract: In terms of total online audience, newspapers are the most successful form of online content to date. The online audience for newspapers continues to demand higher-quality services, including personalized news services. News providers should be able to offer suitable users appropriate content. In this paper, a news article recommender system is suggested based on a user-s preference when he or she visits an Internet news site and reads the published articles. This system helps raise the user-s satisfaction, increase customer loyalty toward the content provider.
Abstract: Searching similar documents and document
management subjects have important place in text mining. One of the
most important parts of similar document research studies is the
process of classifying or clustering the documents. In this study, a
similar document search approach that includes discussion of out the
case of belonging to multiple categories (multiple categories
problem) has been carried. The proposed method that based on Fuzzy
Similarity Classification (FSC) has been compared with Rocchio
algorithm and naive Bayes method which are widely used in text
mining. Empirical results show that the proposed method is quite
successful and can be applied effectively. For the second stage,
multiple categories vector method based on information of categories
regarding to frequency of being seen together has been used.
Empirical results show that achievement is increased almost two
times, when proposed method is compared with classical approach.
Abstract: Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.
Abstract: Microarrays technique allows the simultaneous measurements of the expression levels of thousands of mRNAs. By mining this data one can identify the dynamics of the gene expression time series. By recourse of principal component analysis, we uncover the circadian rhythmic patterns underlying the gene expression profiles from Cyanobacterium Synechocystis. We applied PCA to reduce the dimensionality of the data set. Examination of the components also provides insight into the underlying factors measured in the experiments. Our results suggest that all rhythmic content of data can be reduced to three main components.
Abstract: In order to study seed yield and seed yield
components in bean under reduced irrigation condition and
assessment drought tolerance of genotypes, 15 lines of White beans
were evaluated in two separate RCB design with 3 replications under
stress and non stress conditions. Analysis of variance showed that
there were significant differences among varieties in terms of traits
under study, indicating the existence of genetic variation among
varieties. The results indicate that drought stress reduced seed yield,
number of seed per plant, biological yield and number of pod in
White been. In non stress condition, yield was highly correlated with
the biological yield, whereas in stress condition it was highly
correlated with harvest index. Results of stepwise regression showed
that, selection can we done based on, biological yield, harvest index,
number of seed per pod, seed length, 100 seed weight. Result of path
analysis showed that the highest direct effect, being positive, was
related to biological yield in non stress and to harvest index in stress
conditions. Factor analysis were accomplished in stress and nonstress
condition a, there were 4 factors that explained more than 76
percent of total variations. We used several selection indices such as
Stress Susceptibility Index ( SSI ), Geometric Mean Productivity (
GMP ), Mean Productivity ( MP ), Stress Tolerance Index ( STI ) and
Tolerance Index ( TOL ) to study drought tolerance of genotypes, we
found that the best Stress Index for selection tolerance genotypes
were STI, GMP and MP were the greatest correlations between these
Indices and seed yield under stress and non stress conditions. In
classification of genotypes base on phenotypic characteristics, using
cluster analysis ( UPGMA ), all allels classified in 5 separate groups
in stress and non stress conditions.
Abstract: Clustering unstructured text documents is an
important issue in data mining community and has a number of
applications such as document archive filtering, document
organization and topic detection and subject tracing. In the real
world, some of the already clustered documents may not be of
importance while new documents of more significance may evolve.
Most of the work done so far in clustering unstructured text
documents overlooks this aspect of clustering. This paper, addresses
this issue by using the Fading Function. The unstructured text
documents are clustered. And for each cluster a statistics structure
called Cluster Profile (CP) is implemented. The cluster profile
incorporates the Fading Function. This Fading Function keeps an
account of the time-dependent importance of the cluster. The work
proposes a novel algorithm Clustering n-ary Merge Algorithm
(CnMA) for unstructured text documents, that uses Cluster Profile
and Fading Function. Experimental results illustrating the
effectiveness of the proposed technique are also included.
Abstract: Market segmentation is one of the most
fundamental strategic marketing concepts. The better the
segment which is chosen for targeting by a particular
organisation, the more successful the organisation is assumed to
be in the marketplace. Also higher education institutions have to
improve their marketing tools for attracting foreign students,
particularly when demanding tuition fees. This contribution
aims at demonstrating the proper usage of the cluster analysis
for segmentation (represented by students' willingness to study
abroad) and also, based on large international survey, offers
some practical marketing implications.
Abstract: For collecting data from all sensor nodes, some
changes in Dynamic Source Routing (DSR) protocol is proposed. At
each hop level, route-ranking technique is used for distributing
packets to different selected routes dynamically. For calculating rank
of a route, different parameters like: delay, residual energy and
probability of packet loss are used. A hybrid topology of
DMPR(Disjoint Multi Path Routing) and MMPR(Meshed Multi Path
Routing) is formed, where braided topology is used in different
faulty zones of network. For reducing energy consumption, variant
transmission ranges is used instead of fixed transmission range. For
reducing number of packet drop, a fuzzy logic inference scheme is
used to insert different types of delays dynamically. A rule based
system infers membership function strength which is used to
calculate the final delay amount to be inserted into each of the node
at different clusters.
In braided path, a proposed 'Dual Line ACK Link'scheme is
proposed for sending ACK signal from a damaged node or link to a
parent node to ensure that any error in link or any node-failure
message may not be lost anyway. This paper tries to design the
theoretical aspects of a model which may be applied for collecting
data from any large hanging iron structure with the help of wireless
sensor network. But analyzing these data is the subject of material
science and civil structural construction technology, that part is out
of scope of this paper.
Abstract: In this paper a one-dimension Self Organizing Map
algorithm (SOM) to perform feature selection is presented. The
algorithm is based on a first classification of the input dataset on a
similarity space. From this classification for each class a set of
positive and negative features is computed. This set of features is
selected as result of the procedure. The procedure is evaluated on an
in-house dataset from a Knowledge Discovery from Text (KDT)
application and on a set of publicly available datasets used in
international feature selection competitions. These datasets come
from KDT applications, drug discovery as well as other applications.
The knowledge of the correct classification available for the training
and validation datasets is used to optimize the parameters for positive
and negative feature extractions. The process becomes feasible for
large and sparse datasets, as the ones obtained in KDT applications,
by using both compression techniques to store the similarity matrix
and speed up techniques of the Kohonen algorithm that take
advantage of the sparsity of the input matrix. These improvements
make it feasible, by using the grid, the application of the
methodology to massive datasets.
Abstract: In this study, a network quality of service (QoS)
evaluation system was proposed. The system used a combination of
fuzzy C-means (FCM) and regression model to analyse and assess the
QoS in a simulated network. Network QoS parameters of multimedia
applications were intelligently analysed by FCM clustering
algorithm. The QoS parameters for each FCM cluster centre were
then inputted to a regression model in order to quantify the overall
QoS. The proposed QoS evaluation system provided valuable
information about the network-s QoS patterns and based on this
information, the overall network-s QoS was effectively quantified.
Abstract: Fuzzy Cognitive Maps (FCMs) have successfully
been applied in numerous domains to show relations between
essential components. In some FCM, there are more nodes, which
related to each other and more nodes means more complex in system
behaviors and analysis. In this paper, a novel learning method used to
construct FCMs based on historical data and by using data mining
and DEMATEL method, a new method defined to reduce nodes
number. This method cluster nodes in FCM based on their cause and
effect behaviors.
Abstract: Most of the biclustering/projected clustering algorithms are based either on the Euclidean distance or correlation coefficient which capture only linear relationships. However, in many applications, like gene expression data and word-document data, non linear relationships may exist between the objects. Mutual Information between two variables provides a more general criterion to investigate dependencies amongst variables. In this paper, we improve upon our previous algorithm that uses mutual information for biclustering in terms of computation time and also the type of clusters identified. The algorithm is able to find biclusters with mixed relationships and is faster than the previous one. To the best of our knowledge, none of the other existing algorithms for biclustering have used mutual information as a similarity measure. We present the experimental results on synthetic data as well as on the yeast expression data. Biclusters on the yeast data were found to be biologically and statistically significant using GO Tool Box and FuncAssociate.
Abstract: There are two major variants of the Simplex
Algorithm: the revised method and the standard, or tableau method.
Today, all serious implementations are based on the revised method
because it is more efficient for sparse linear programming problems.
Moreover, there are a number of applications that lead to dense linear
problems so our aim in this paper is to present some computational
results on parallel implementation of dense Simplex Method. Our
implementation is implemented on a SMP cluster using C
programming language and the Message Passing Interface MPI.
Preliminary computational results on randomly generated dense
linear programs support our results.
Abstract: Mobile ad-hoc networks (MANETs) are a form of
wireless networks which do not require a base station for providing
network connectivity. Mobile ad-hoc networks have many
characteristics which distinguish them from other wireless networks
which make routing in such networks a challenging task. Cluster
based routing is one of the routing schemes for MANETs in which
various clusters of mobile nodes are formed with each cluster having
its own clusterhead which is responsible for routing among clusters.
In this paper we have proposed and implemented a distributed
weighted clustering algorithm for MANETs. This approach is based
on combined weight metric that takes into account several system
parameters like the node degree, transmission range, energy and
mobility of the nodes. We have evaluated the performance of
proposed scheme through simulation in various network situations.
Simulation results show that proposed scheme outperforms the
original distributed weighted clustering algorithm (DWCA).
Abstract: Today, money laundering (ML) poses a serious threat
not only to financial institutions but also to the nation. This criminal
activity is becoming more and more sophisticated and seems to have
moved from the cliché of drug trafficking to financing terrorism and
surely not forgetting personal gain. Most international financial
institutions have been implementing anti-money laundering solutions
(AML) to fight investment fraud. However, traditional investigative
techniques consume numerous man-hours. Recently, data mining
approaches have been developed and are considered as well-suited
techniques for detecting ML activities. Within the scope of a
collaboration project for the purpose of developing a new solution for
the AML Units in an international investment bank, we proposed a
data mining-based solution for AML. In this paper, we present a
heuristics approach to improve the performance for this solution. We
also show some preliminary results associated with this method on
analysing transaction datasets.
Abstract: This paper presents a new technique for detection of
human faces within color images. The approach relies on image
segmentation based on skin color, features extracted from the two-dimensional
discrete cosine transform (DCT), and self-organizing
maps (SOM). After candidate skin regions are extracted, feature
vectors are constructed using DCT coefficients computed from those
regions. A supervised SOM training session is used to cluster feature
vectors into groups, and to assign “face" or “non-face" labels to those
clusters. Evaluation was performed using a new image database of
286 images, containing 1027 faces. After training, our detection
technique achieved a detection rate of 77.94% during subsequent
tests, with a false positive rate of 5.14%. To our knowledge, the
proposed technique is the first to combine DCT-based feature
extraction with a SOM for detecting human faces within color
images. It is also one of a few attempts to combine a feature-invariant
approach, such as color-based skin segmentation, together with
appearance-based face detection. The main advantage of the new
technique is its low computational requirements, in terms of both
processing speed and memory utilization.
Abstract: Color image segmentation can be considered as a
cluster procedure in feature space. k-means and its adaptive
version, i.e. competitive learning approach are powerful tools
for data clustering. But k-means and competitive learning suffer
from several drawbacks such as dead-unit problem and need to
pre-specify number of cluster. In this paper, we will explore to
use competitive and cooperative learning approach to perform
color image segmentation. In competitive and cooperative
learning approach, seed points not only compete each other, but
also the winner will dynamically select several nearest
competitors to form a cooperative team to adapt to the input
together, finally it can automatically select the correct number
of cluster and avoid the dead-units problem. Experimental
results show that CCL can obtain better segmentation result.
Abstract: The Neuro-Fuzzy hybridization scheme has become
of research interest in pattern classification over the past decade. The
present paper proposes a novel Modified Adaptive Fuzzy Inference
Engine (MAFIE) for pattern classification. A modified Apriori
algorithm technique is utilized to reduce a minimal set of decision
rules based on input output data sets. A TSK type fuzzy inference
system is constructed by the automatic generation of membership
functions and rules by the fuzzy c-means clustering and Apriori
algorithm technique, respectively. The generated adaptive fuzzy
inference engine is adjusted by the least-squares fit and a conjugate
gradient descent algorithm towards better performance with a
minimal set of rules. The proposed MAFIE is able to reduce the
number of rules which increases exponentially when more input
variables are involved. The performance of the proposed MAFIE is
compared with other existing applications of pattern classification
schemes using Fisher-s Iris and Wisconsin breast cancer data sets and
shown to be very competitive.