K-Means for Spherical Clusters with Large Variance in Sizes

Data clustering is an important data exploration technique with many applications in data mining. The k-means algorithm is well known for its efficiency in clustering large data sets. However, this algorithm is suitable for spherical shaped clusters of similar sizes and densities. The quality of the resulting clusters decreases when the data set contains spherical shaped with large variance in sizes. In this paper, we introduce a competent procedure to overcome this problem. The proposed method is based on shifting the center of the large cluster toward the small cluster, and recomputing the membership of small cluster points, the experimental results reveal that the proposed algorithm produces satisfactory results.

Modelling Peer Group Dieting Behaviour

The aim of this paper is to understand how peers can influence adolescent girls- dieting behaviour and their body image. Departing from imitation and social learning theories, we study whether adolescent girls tend to model their peer group dieting behaviours, thus influencing their body image construction. Our study was conducted through an enquiry applied to a cluster sample of 466 adolescent high school girls in Lisbon city public schools. Our main findings point to an association between girls- and peers- dieting behaviours, thus reinforcing the modelling hypothesis.

Searching for Similar Informational Articles in the Internet Channel

In terms of total online audience, newspapers are the most successful form of online content to date. The online audience for newspapers continues to demand higher-quality services, including personalized news services. News providers should be able to offer suitable users appropriate content. In this paper, a news article recommender system is suggested based on a user-s preference when he or she visits an Internet news site and reads the published articles. This system helps raise the user-s satisfaction, increase customer loyalty toward the content provider.

A Study on Finding Similar Document with Multiple Categories

Searching similar documents and document management subjects have important place in text mining. One of the most important parts of similar document research studies is the process of classifying or clustering the documents. In this study, a similar document search approach that includes discussion of out the case of belonging to multiple categories (multiple categories problem) has been carried. The proposed method that based on Fuzzy Similarity Classification (FSC) has been compared with Rocchio algorithm and naive Bayes method which are widely used in text mining. Empirical results show that the proposed method is quite successful and can be applied effectively. For the second stage, multiple categories vector method based on information of categories regarding to frequency of being seen together has been used. Empirical results show that achievement is increased almost two times, when proposed method is compared with classical approach.

Evaluation of Clustering Based on Preprocessing in Gene Expression Data

Microarrays have become the effective, broadly used tools in biological and medical research to address a wide range of problems, including classification of disease subtypes and tumors. Many statistical methods are available for analyzing and systematizing these complex data into meaningful information, and one of the main goals in analyzing gene expression data is the detection of samples or genes with similar expression patterns. In this paper, we express and compare the performance of several clustering methods based on data preprocessing including strategies of normalization or noise clearness. We also evaluate each of these clustering methods with validation measures for both simulated data and real gene expression data. Consequently, clustering methods which are common used in microarray data analysis are affected by normalization and degree of noise and clearness for datasets.

Dynamical Analysis of Circadian Gene Expression

Microarrays technique allows the simultaneous measurements of the expression levels of thousands of mRNAs. By mining this data one can identify the dynamics of the gene expression time series. By recourse of principal component analysis, we uncover the circadian rhythmic patterns underlying the gene expression profiles from Cyanobacterium Synechocystis. We applied PCA to reduce the dimensionality of the data set. Examination of the components also provides insight into the underlying factors measured in the experiments. Our results suggest that all rhythmic content of data can be reduced to three main components.

Influence of Drought on Yield and Yield Components in White Bean

In order to study seed yield and seed yield components in bean under reduced irrigation condition and assessment drought tolerance of genotypes, 15 lines of White beans were evaluated in two separate RCB design with 3 replications under stress and non stress conditions. Analysis of variance showed that there were significant differences among varieties in terms of traits under study, indicating the existence of genetic variation among varieties. The results indicate that drought stress reduced seed yield, number of seed per plant, biological yield and number of pod in White been. In non stress condition, yield was highly correlated with the biological yield, whereas in stress condition it was highly correlated with harvest index. Results of stepwise regression showed that, selection can we done based on, biological yield, harvest index, number of seed per pod, seed length, 100 seed weight. Result of path analysis showed that the highest direct effect, being positive, was related to biological yield in non stress and to harvest index in stress conditions. Factor analysis were accomplished in stress and nonstress condition a, there were 4 factors that explained more than 76 percent of total variations. We used several selection indices such as Stress Susceptibility Index ( SSI ), Geometric Mean Productivity ( GMP ), Mean Productivity ( MP ), Stress Tolerance Index ( STI ) and Tolerance Index ( TOL ) to study drought tolerance of genotypes, we found that the best Stress Index for selection tolerance genotypes were STI, GMP and MP were the greatest correlations between these Indices and seed yield under stress and non stress conditions. In classification of genotypes base on phenotypic characteristics, using cluster analysis ( UPGMA ), all allels classified in 5 separate groups in stress and non stress conditions.

Clustering Unstructured Text Documents Using Fading Function

Clustering unstructured text documents is an important issue in data mining community and has a number of applications such as document archive filtering, document organization and topic detection and subject tracing. In the real world, some of the already clustered documents may not be of importance while new documents of more significance may evolve. Most of the work done so far in clustering unstructured text documents overlooks this aspect of clustering. This paper, addresses this issue by using the Fading Function. The unstructured text documents are clustered. And for each cluster a statistics structure called Cluster Profile (CP) is implemented. The cluster profile incorporates the Fading Function. This Fading Function keeps an account of the time-dependent importance of the cluster. The work proposes a novel algorithm Clustering n-ary Merge Algorithm (CnMA) for unstructured text documents, that uses Cluster Profile and Fading Function. Experimental results illustrating the effectiveness of the proposed technique are also included.

Marketing Segmentation of Students Willing to Study Abroad based on Cluster Analysis

Market segmentation is one of the most fundamental strategic marketing concepts. The better the segment which is chosen for targeting by a particular organisation, the more successful the organisation is assumed to be in the marketplace. Also higher education institutions have to improve their marketing tools for attracting foreign students, particularly when demanding tuition fees. This contribution aims at demonstrating the proper usage of the cluster analysis for segmentation (represented by students' willingness to study abroad) and also, based on large international survey, offers some practical marketing implications.

Multipath Routing Sensor Network for Finding Crack in Metallic Structure Using Fuzzy Logic

For collecting data from all sensor nodes, some changes in Dynamic Source Routing (DSR) protocol is proposed. At each hop level, route-ranking technique is used for distributing packets to different selected routes dynamically. For calculating rank of a route, different parameters like: delay, residual energy and probability of packet loss are used. A hybrid topology of DMPR(Disjoint Multi Path Routing) and MMPR(Meshed Multi Path Routing) is formed, where braided topology is used in different faulty zones of network. For reducing energy consumption, variant transmission ranges is used instead of fixed transmission range. For reducing number of packet drop, a fuzzy logic inference scheme is used to insert different types of delays dynamically. A rule based system infers membership function strength which is used to calculate the final delay amount to be inserted into each of the node at different clusters. In braided path, a proposed 'Dual Line ACK Link'scheme is proposed for sending ACK signal from a damaged node or link to a parent node to ensure that any error in link or any node-failure message may not be lost anyway. This paper tries to design the theoretical aspects of a model which may be applied for collecting data from any large hanging iron structure with the help of wireless sensor network. But analyzing these data is the subject of material science and civil structural construction technology, that part is out of scope of this paper.

Feature Selection with Kohonen Self Organizing Classification Algorithm

In this paper a one-dimension Self Organizing Map algorithm (SOM) to perform feature selection is presented. The algorithm is based on a first classification of the input dataset on a similarity space. From this classification for each class a set of positive and negative features is computed. This set of features is selected as result of the procedure. The procedure is evaluated on an in-house dataset from a Knowledge Discovery from Text (KDT) application and on a set of publicly available datasets used in international feature selection competitions. These datasets come from KDT applications, drug discovery as well as other applications. The knowledge of the correct classification available for the training and validation datasets is used to optimize the parameters for positive and negative feature extractions. The process becomes feasible for large and sparse datasets, as the ones obtained in KDT applications, by using both compression techniques to store the similarity matrix and speed up techniques of the Kohonen algorithm that take advantage of the sparsity of the input matrix. These improvements make it feasible, by using the grid, the application of the methodology to massive datasets.

Quality of Service Evaluation using a Combination of Fuzzy C-Means and Regression Model

In this study, a network quality of service (QoS) evaluation system was proposed. The system used a combination of fuzzy C-means (FCM) and regression model to analyse and assess the QoS in a simulated network. Network QoS parameters of multimedia applications were intelligently analysed by FCM clustering algorithm. The QoS parameters for each FCM cluster centre were then inputted to a regression model in order to quantify the overall QoS. The proposed QoS evaluation system provided valuable information about the network-s QoS patterns and based on this information, the overall network-s QoS was effectively quantified.

Using Data Mining for Learning and Clustering FCM

Fuzzy Cognitive Maps (FCMs) have successfully been applied in numerous domains to show relations between essential components. In some FCM, there are more nodes, which related to each other and more nodes means more complex in system behaviors and analysis. In this paper, a novel learning method used to construct FCMs based on historical data and by using data mining and DEMATEL method, a new method defined to reduce nodes number. This method cluster nodes in FCM based on their cause and effect behaviors.

MIBiClus: Mutual Information based Biclustering Algorithm

Most of the biclustering/projected clustering algorithms are based either on the Euclidean distance or correlation coefficient which capture only linear relationships. However, in many applications, like gene expression data and word-document data, non linear relationships may exist between the objects. Mutual Information between two variables provides a more general criterion to investigate dependencies amongst variables. In this paper, we improve upon our previous algorithm that uses mutual information for biclustering in terms of computation time and also the type of clusters identified. The algorithm is able to find biclusters with mixed relationships and is faster than the previous one. To the best of our knowledge, none of the other existing algorithms for biclustering have used mutual information as a similarity measure. We present the experimental results on synthetic data as well as on the yeast expression data. Biclusters on the yeast data were found to be biologically and statistically significant using GO Tool Box and FuncAssociate.

Some Computational Results on MPI Parallel Implementation of Dense Simplex Method

There are two major variants of the Simplex Algorithm: the revised method and the standard, or tableau method. Today, all serious implementations are based on the revised method because it is more efficient for sparse linear programming problems. Moreover, there are a number of applications that lead to dense linear problems so our aim in this paper is to present some computational results on parallel implementation of dense Simplex Method. Our implementation is implemented on a SMP cluster using C programming language and the Message Passing Interface MPI. Preliminary computational results on randomly generated dense linear programs support our results.

A Distributed Weighted Cluster Based Routing Protocol for Manets

Mobile ad-hoc networks (MANETs) are a form of wireless networks which do not require a base station for providing network connectivity. Mobile ad-hoc networks have many characteristics which distinguish them from other wireless networks which make routing in such networks a challenging task. Cluster based routing is one of the routing schemes for MANETs in which various clusters of mobile nodes are formed with each cluster having its own clusterhead which is responsible for routing among clusters. In this paper we have proposed and implemented a distributed weighted clustering algorithm for MANETs. This approach is based on combined weight metric that takes into account several system parameters like the node degree, transmission range, energy and mobility of the nodes. We have evaluated the performance of proposed scheme through simulation in various network situations. Simulation results show that proposed scheme outperforms the original distributed weighted clustering algorithm (DWCA).

A Heuristics Approach for Fast Detecting Suspicious Money Laundering Cases in an Investment Bank

Today, money laundering (ML) poses a serious threat not only to financial institutions but also to the nation. This criminal activity is becoming more and more sophisticated and seems to have moved from the cliché of drug trafficking to financing terrorism and surely not forgetting personal gain. Most international financial institutions have been implementing anti-money laundering solutions (AML) to fight investment fraud. However, traditional investigative techniques consume numerous man-hours. Recently, data mining approaches have been developed and are considered as well-suited techniques for detecting ML activities. Within the scope of a collaboration project for the purpose of developing a new solution for the AML Units in an international investment bank, we proposed a data mining-based solution for AML. In this paper, we present a heuristics approach to improve the performance for this solution. We also show some preliminary results associated with this method on analysing transaction datasets.

A New Face Detection Technique using 2D DCT and Self Organizing Feature Map

This paper presents a new technique for detection of human faces within color images. The approach relies on image segmentation based on skin color, features extracted from the two-dimensional discrete cosine transform (DCT), and self-organizing maps (SOM). After candidate skin regions are extracted, feature vectors are constructed using DCT coefficients computed from those regions. A supervised SOM training session is used to cluster feature vectors into groups, and to assign “face" or “non-face" labels to those clusters. Evaluation was performed using a new image database of 286 images, containing 1027 faces. After training, our detection technique achieved a detection rate of 77.94% during subsequent tests, with a false positive rate of 5.14%. To our knowledge, the proposed technique is the first to combine DCT-based feature extraction with a SOM for detecting human faces within color images. It is also one of a few attempts to combine a feature-invariant approach, such as color-based skin segmentation, together with appearance-based face detection. The main advantage of the new technique is its low computational requirements, in terms of both processing speed and memory utilization.

Color Image Segmentation Using Competitive and Cooperative Learning Approach

Color image segmentation can be considered as a cluster procedure in feature space. k-means and its adaptive version, i.e. competitive learning approach are powerful tools for data clustering. But k-means and competitive learning suffer from several drawbacks such as dead-unit problem and need to pre-specify number of cluster. In this paper, we will explore to use competitive and cooperative learning approach to perform color image segmentation. In competitive and cooperative learning approach, seed points not only compete each other, but also the winner will dynamically select several nearest competitors to form a cooperative team to adapt to the input together, finally it can automatically select the correct number of cluster and avoid the dead-units problem. Experimental results show that CCL can obtain better segmentation result.

A Novel Modified Adaptive Fuzzy Inference Engine and Its Application to Pattern Classification

The Neuro-Fuzzy hybridization scheme has become of research interest in pattern classification over the past decade. The present paper proposes a novel Modified Adaptive Fuzzy Inference Engine (MAFIE) for pattern classification. A modified Apriori algorithm technique is utilized to reduce a minimal set of decision rules based on input output data sets. A TSK type fuzzy inference system is constructed by the automatic generation of membership functions and rules by the fuzzy c-means clustering and Apriori algorithm technique, respectively. The generated adaptive fuzzy inference engine is adjusted by the least-squares fit and a conjugate gradient descent algorithm towards better performance with a minimal set of rules. The proposed MAFIE is able to reduce the number of rules which increases exponentially when more input variables are involved. The performance of the proposed MAFIE is compared with other existing applications of pattern classification schemes using Fisher-s Iris and Wisconsin breast cancer data sets and shown to be very competitive.