A Text Clustering System based on k-means Type Subspace Clustering and Ontology

This paper presents a text clustering system developed based on a k-means type subspace clustering algorithm to cluster large, high dimensional and sparse text data. In this algorithm, a new step is added in the k-means clustering process to automatically calculate the weights of keywords in each cluster so that the important words of a cluster can be identified by the weight values. For understanding and interpretation of clustering results, a few keywords that can best represent the semantic topic are extracted from each cluster. Two methods are used to extract the representative words. The candidate words are first selected according to their weights calculated by our new algorithm. Then, the candidates are fed to the WordNet to identify the set of noun words and consolidate the synonymy and hyponymy words. Experimental results have shown that the clustering algorithm is superior to the other subspace clustering algorithms, such as PROCLUS and HARP and kmeans type algorithm, e.g., Bisecting-KMeans. Furthermore, the word extraction method is effective in selection of the words to represent the topics of the clusters.

Accelerating Sparse Matrix Vector Multiplication on Many-Core GPUs

Many-core GPUs provide high computing ability and substantial bandwidth; however, optimizing irregular applications like SpMV on GPUs becomes a difficult but meaningful task. In this paper, we propose a novel method to improve the performance of SpMV on GPUs. A new storage format called HYB-R is proposed to exploit GPU architecture more efficiently. The COO portion of the matrix is partitioned recursively into a ELL portion and a COO portion in the process of creating HYB-R format to ensure that there are as many non-zeros as possible in ELL format. The method of partitioning the matrix is an important problem for HYB-R kernel, so we also try to tune the parameters to partition the matrix for higher performance. Experimental results show that our method can get better performance than the fastest kernel (HYB) in NVIDIA-s SpMV library with as high as 17% speedup.

A Hybrid CamShift and l1-Minimization Video Tracking Algorithm

The Continuously Adaptive Mean-Shift (CamShift) algorithm, incorporating scene depth information is combined with the l1-minimization sparse representation based method to form a hybrid kernel and state space-based tracking algorithm. We take advantage of the increased efficiency of the former with the robustness to occlusion property of the latter. A simple interchange scheme transfers control between algorithms based upon drift and occlusion likelihood. It is quantified by the projection of target candidates onto a depth map of the 2D scene obtained with a low cost stereo vision webcam. Results are improved tracking in terms of drift over each algorithm individually, in a challenging practical outdoor multiple occlusion test case.

Sparse Frequencies Extracting from Partial Phase-Only Measurements

This paper considers a robust recovery of sparse frequencies from partial phase-only measurements. With the proposed method, sparse frequencies can be reconstructed, which makes full use of the sparse distribution in the Fourier representation of the complex-valued time signal. Simulation experiments illustrate the proposed method-s advantages over conventional methods in both noiseless and additive white Gaussian noise cases.

Performance Analysis of Learning Automata-Based Routing Algorithms in Sparse Graphs

A number of routing algorithms based on learning automata technique have been proposed for communication networks. How ever, there has been little work on the effects of variation of graph scarcity on the performance of these algorithms. In this paper, a comprehensive study is launched to investigate the performance of LASPA, the first learning automata based solution to the dynamic shortest path routing, across different graph structures with varying scarcities. The sensitivity of three main performance parameters of the algorithm, being average number of processed nodes, scanned edges and average time per update, to variation in graph scarcity is reported. Simulation results indicate that the LASPA algorithm can adapt well to the scarcity variation in graph structure and gives much better outputs than the existing dynamic and fixed algorithms in terms of performance criteria.

Multidimensional Data Mining by Means of Randomly Travelling Hyper-Ellipsoids

The present study presents a new approach to automatic data clustering and classification problems in large and complex databases and, at the same time, derives specific types of explicit rules describing each cluster. The method works well in both sparse and dense multidimensional data spaces. The members of the data space can be of the same nature or represent different classes. A number of N-dimensional ellipsoids are used for enclosing the data clouds. Due to the geometry of an ellipsoid and its free rotation in space the detection of clusters becomes very efficient. The method is based on genetic algorithms that are used for the optimization of location, orientation and geometric characteristics of the hyper-ellipsoids. The proposed approach can serve as a basis for the development of general knowledge systems for discovering hidden knowledge and unexpected patterns and rules in various large databases.

Issues in Spectral Source Separation Techniques for Plant-wide Oscillation Detection and Diagnosis

In the last few years, three multivariate spectral analysis techniques namely, Principal Component Analysis (PCA), Independent Component Analysis (ICA) and Non-negative Matrix Factorization (NMF) have emerged as effective tools for oscillation detection and isolation. While the first method is used in determining the number of oscillatory sources, the latter two methods are used to identify source signatures by formulating the detection problem as a source identification problem in the spectral domain. In this paper, we present a critical drawback of the underlying linear (mixing) model which strongly limits the ability of the associated source separation methods to determine the number of sources and/or identify the physical source signatures. It is shown that the assumed mixing model is only valid if each unit of the process gives equal weighting (all-pass filter) to all oscillatory components in its inputs. This is in contrast to the fact that each unit, in general, acts as a filter with non-uniform frequency response. Thus, the model can only facilitate correct identification of a source with a single frequency component, which is again unrealistic. To overcome this deficiency, an iterative post-processing algorithm that correctly identifies the physical source(s) is developed. An additional issue with the existing methods is that they lack a procedure to pre-screen non-oscillatory/noisy measurements which obscure the identification of oscillatory sources. In this regard, a pre-screening procedure is prescribed based on the notion of sparseness index to eliminate the noisy and non-oscillatory measurements from the data set used for analysis.