Efficient Implementation of Serial and Parallel Support Vector Machine Training with a Multi-Parameter Kernel for Large-Scale Data Mining

This work deals with aspects of support vector learning for large-scale data mining tasks. Based on a decomposition algorithm that can be run in serial and parallel mode we introduce a data transformation that allows for the usage of an expensive generalized kernel without additional costs. In order to speed up the decomposition algorithm we analyze the problem of working set selection for large data sets and analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our modifications and settings lead to improvement of support vector learning performance and thus allow using extensive parameter search methods to optimize classification accuracy.

A Genetic Algorithm for Clustering on Image Data

Clustering is the process of subdividing an input data set into a desired number of subgroups so that members of the same subgroup are similar and members of different subgroups have diverse properties. Many heuristic algorithms have been applied to the clustering problem, which is known to be NP Hard. Genetic algorithms have been used in a wide variety of fields to perform clustering, however, the technique normally has a long running time in terms of input set size. This paper proposes an efficient genetic algorithm for clustering on very large data sets, especially on image data sets. The genetic algorithm uses the most time efficient techniques along with preprocessing of the input data set. We test our algorithm on both artificial and real image data sets, both of which are of large size. The experimental results show that our algorithm outperforms the k-means algorithm in terms of running time as well as the quality of the clustering.

K-Means for Spherical Clusters with Large Variance in Sizes

Data clustering is an important data exploration technique with many applications in data mining. The k-means algorithm is well known for its efficiency in clustering large data sets. However, this algorithm is suitable for spherical shaped clusters of similar sizes and densities. The quality of the resulting clusters decreases when the data set contains spherical shaped with large variance in sizes. In this paper, we introduce a competent procedure to overcome this problem. The proposed method is based on shifting the center of the large cluster toward the small cluster, and recomputing the membership of small cluster points, the experimental results reveal that the proposed algorithm produces satisfactory results.

Addressing Scalability Issues of Named Entity Recognition Using Multi-Class Support Vector Machines

This paper explores the scalability issues associated with solving the Named Entity Recognition (NER) problem using Support Vector Machines (SVM) and high-dimensional features. The performance results of a set of experiments conducted using binary and multi-class SVM with increasing training data sizes are examined. The NER domain chosen for these experiments is the biomedical publications domain, especially selected due to its importance and inherent challenges. A simple machine learning approach is used that eliminates prior language knowledge such as part-of-speech or noun phrase tagging thereby allowing for its applicability across languages. No domain-specific knowledge is included. The accuracy measures achieved are comparable to those obtained using more complex approaches, which constitutes a motivation to investigate ways to improve the scalability of multiclass SVM in order to make the solution more practical and useable. Improving training time of multi-class SVM would make support vector machines a more viable and practical machine learning solution for real-world problems with large datasets. An initial prototype results in great improvement of the training time at the expense of memory requirements.

Class Outliers Mining: Distance-Based Approach

In large datasets, identifying exceptional or rare cases with respect to a group of similar cases is considered very significant problem. The traditional problem (Outlier Mining) is to find exception or rare cases in a dataset irrespective of the class label of these cases, they are considered rare events with respect to the whole dataset. In this research, we pose the problem that is Class Outliers Mining and a method to find out those outliers. The general definition of this problem is “given a set of observations with class labels, find those that arouse suspicions, taking into account the class labels". We introduce a novel definition of Outlier that is Class Outlier, and propose the Class Outlier Factor (COF) which measures the degree of being a Class Outlier for a data object. Our work includes a proposal of a new algorithm towards mining of the Class Outliers, presenting experimental results applied on various domains of real world datasets and finally a comparison study with other related methods is performed.

Generating Frequent Patterns through Intersection between Transactions

The problem of frequent itemset mining is considered in this paper. One new technique proposed to generate frequent patterns in large databases without time-consuming candidate generation. This technique is based on focusing on transaction instead of concentrating on itemset. This algorithm based on take intersection between one transaction and others transaction and the maximum shared items between transactions computed instead of creating itemset and computing their frequency. With applying real life transactions and some consumption is taken from real life data, the significant efficiency acquire from databases in generation association rules mining.

On the Efficient Implementation of a Serial and Parallel Decomposition Algorithm for Fast Support Vector Machine Training Including a Multi-Parameter Kernel

This work deals with aspects of support vector machine learning for large-scale data mining tasks. Based on a decomposition algorithm for support vector machine training that can be run in serial as well as shared memory parallel mode we introduce a transformation of the training data that allows for the usage of an expensive generalized kernel without additional costs. We present experiments for the Gaussian kernel, but usage of other kernel functions is possible, too. In order to further speed up the decomposition algorithm we analyze the critical problem of working set selection for large training data sets. In addition, we analyze the influence of the working set sizes onto the scalability of the parallel decomposition scheme. Our tests and conclusions led to several modifications of the algorithm and the improvement of overall support vector machine learning performance. Our method allows for using extensive parameter search methods to optimize classification accuracy.

A Novel In-Place Sorting Algorithm with O(n log z) Comparisons and O(n log z) Moves

In-place sorting algorithms play an important role in many fields such as very large database systems, data warehouses, data mining, etc. Such algorithms maximize the size of data that can be processed in main memory without input/output operations. In this paper, a novel in-place sorting algorithm is presented. The algorithm comprises two phases; rearranging the input unsorted array in place, resulting segments that are ordered relative to each other but whose elements are yet to be sorted. The first phase requires linear time, while, in the second phase, elements of each segment are sorted inplace in the order of z log (z), where z is the size of the segment, and O(1) auxiliary storage. The algorithm performs, in the worst case, for an array of size n, an O(n log z) element comparisons and O(n log z) element moves. Further, no auxiliary arithmetic operations with indices are required. Besides these theoretical achievements of this algorithm, it is of practical interest, because of its simplicity. Experimental results also show that it outperforms other in-place sorting algorithms. Finally, the analysis of time and space complexity, and required number of moves are presented, along with the auxiliary storage requirements of the proposed algorithm.

Genetic Programming Approach to Hierarchical Production Rule Discovery

Automated discovery of hierarchical structures in large data sets has been an active research area in the recent past. This paper focuses on the issue of mining generalized rules with crisp hierarchical structure using Genetic Programming (GP) approach to knowledge discovery. The post-processing scheme presented in this work uses flat rules as initial individuals of GP and discovers hierarchical structure. Suitable genetic operators are proposed for the suggested encoding. Based on the Subsumption Matrix(SM), an appropriate fitness function is suggested. Finally, Hierarchical Production Rules (HPRs) are generated from the discovered hierarchy. Experimental results are presented to demonstrate the performance of the proposed algorithm.

An Efficient Graph Query Algorithm Based on Important Vertices and Decision Features

Graph has become increasingly important in modeling complicated structures and schemaless data such as proteins, chemical compounds, and XML documents. Given a graph query, it is desirable to retrieve graphs quickly from a large database via graph-based indices. Different from the existing methods, our approach, called VFM (Vertex to Frequent Feature Mapping), makes use of vertices and decision features as the basic indexing feature. VFM constructs two mappings between vertices and frequent features to answer graph queries. The VFM approach not only provides an elegant solution to the graph indexing problem, but also demonstrates how database indexing and query processing can benefit from data mining, especially frequent pattern mining. The results show that the proposed method not only avoids the enumeration method of getting subgraphs of query graph, but also effectively reduces the subgraph isomorphism tests between the query graph and graphs in candidate answer set in verification stage.

Computational Model for Predicting Effective siRNA Sequences Using Whole Stacking Energy (% G) for Gene Silencing

The small interfering RNA (siRNA) alters the regulatory role of mRNA during gene expression by translational inhibition. Recent studies show that upregulation of mRNA because serious diseases like cancer. So designing effective siRNA with good knockdown effects plays an important role in gene silencing. Various siRNA design tools had been developed earlier. In this work, we are trying to analyze the existing good scoring second generation siRNA predicting tools and to optimize the efficiency of siRNA prediction by designing a computational model using Artificial Neural Network and whole stacking energy (%G), which may help in gene silencing and drug design in cancer therapy. Our model is trained and tested against a large data set of siRNA sequences. Validation of our results is done by finding correlation coefficient of experimental versus observed inhibition efficacy of siRNA. We achieved a correlation coefficient of 0.727 in our previous computational model and we could improve the correlation coefficient up to 0.753 when the threshold of whole tacking energy is greater than or equal to -32.5 kcal/mol.

Clustering in WSN Based on Minimum Spanning Tree Using Divide and Conquer Approach

Due to heavy energy constraints in WSNs clustering is an efficient way to manage the energy in sensors. There are many methods already proposed in the area of clustering and research is still going on to make clustering more energy efficient. In our paper we are proposing a minimum spanning tree based clustering using divide and conquer approach. The MST based clustering was first proposed in 1970’s for large databases. Here we are taking divide and conquer approach and implementing it for wireless sensor networks with the constraints attached to the sensor networks. This Divide and conquer approach is implemented in a way that we don’t have to construct the whole MST before clustering but we just find the edge which will be the part of the MST to a corresponding graph and divide the graph in clusters there itself if that edge from the graph can be removed judging on certain constraints and hence saving lot of computation.

Parallel Algorithm for Numerical Solution of Three-Dimensional Poisson Equation

In this paper developed and realized absolutely new algorithm for solving three-dimensional Poisson equation. This equation used in research of turbulent mixing, computational fluid dynamics, atmospheric front, and ocean flows and so on. Moreover in the view of rising productivity of difficult calculation there was applied the most up-to-date and the most effective parallel programming technology - MPI in combination with OpenMP direction, that allows to realize problems with very large data content. Resulted products can be used in solving of important applications and fundamental problems in mathematics and physics.

Multiclass Support Vector Machines for Environmental Sounds Classification Using log-Gabor Filters

In this paper we propose a robust environmental sound classification approach, based on spectrograms features driven from log-Gabor filters. This approach includes two methods. In the first methods, the spectrograms are passed through an appropriate log-Gabor filter banks and the outputs are averaged and underwent an optimal feature selection procedure based on a mutual information criteria. The second method uses the same steps but applied only to three patches extracted from each spectrogram. To investigate the accuracy of the proposed methods, we conduct experiments using a large database containing 10 environmental sound classes. The classification results based on Multiclass Support Vector Machines show that the second method is the most efficient with an average classification accuracy of 89.62 %.

Generating Concept Trees from Dynamic Self-organizing Map

Self-organizing map (SOM) provides both clustering and visualization capabilities in mining data. Dynamic self-organizing maps such as Growing Self-organizing Map (GSOM) has been developed to overcome the problem of fixed structure in SOM to enable better representation of the discovered patterns. However, in mining large datasets or historical data the hierarchical structure of the data is also useful to view the cluster formation at different levels of abstraction. In this paper, we present a technique to generate concept trees from the GSOM. The formation of tree from different spread factor values of GSOM is also investigated and the quality of the trees analyzed. The results show that concept trees can be generated from GSOM, thus, eliminating the need for re-clustering of the data from scratch to obtain a hierarchical view of the data under study.

IMDC: An Image-Mapped Data Clustering Technique for Large Datasets

In this paper, we present a new algorithm for clustering data in large datasets using image processing approaches. First the dataset is mapped into a binary image plane. The synthesized image is then processed utilizing efficient image processing techniques to cluster the data in the dataset. Henceforth, the algorithm avoids exhaustive search to identify clusters. The algorithm considers only a small set of the data that contains critical boundary information sufficient to identify contained clusters. Compared to available data clustering techniques, the proposed algorithm produces similar quality results and outperforms them in execution time and storage requirements.

A Hybrid Approach for Quantification of Novelty in Rule Discovery

Rule Discovery is an important technique for mining knowledge from large databases. Use of objective measures for discovering interesting rules lead to another data mining problem, although of reduced complexity. Data mining researchers have studied subjective measures of interestingness to reduce the volume of discovered rules to ultimately improve the overall efficiency of KDD process. In this paper we study novelty of the discovered rules as a subjective measure of interestingness. We propose a hybrid approach that uses objective and subjective measures to quantify novelty of the discovered rules in terms of their deviations from the known rules. We analyze the types of deviation that can arise between two rules and categorize the discovered rules according to the user specified threshold. We implement the proposed framework and experiment with some public datasets. The experimental results are quite promising.

Strategic Software Development: Productivity Comparisons of General Development Programs

Productivity has been one of the major concerns with the increasingly high cost of software development. Choosing the right development language with high productivity is one approach to reduce development costs. Working on the large database with 4106 projects ever developed, we found the factors significant to productivity. After the removal of the effects of other factors on productivity, we compare the productivity differences of the ten general development programs. The study supports the fact that fourth-generation languages are more productive than thirdgeneration languages.

Density Clustering Based On Radius of Data (DCBRD)

Clustering algorithms are attractive for the task of class identification in spatial databases. However, the application to large spatial databases rises the following requirements for clustering algorithms: minimal requirements of domain knowledge to determine the input parameters, discovery of clusters with arbitrary shape and good efficiency on large databases. The well-known clustering algorithms offer no solution to the combination of these requirements. In this paper, a density based clustering algorithm (DCBRD) is presented, relying on a knowledge acquired from the data by dividing the data space into overlapped regions. The proposed algorithm discovers arbitrary shaped clusters, requires no input parameters and uses the same definitions of DBSCAN algorithm. We performed an experimental evaluation of the effectiveness and efficiency of it, and compared this results with that of DBSCAN. The results of our experiments demonstrate that the proposed algorithm is significantly efficient in discovering clusters of arbitrary shape and size.

On Speeding Up Support Vector Machines: Proximity Graphs Versus Random Sampling for Pre-Selection Condensation

Support vector machines (SVMs) are considered to be the best machine learning algorithms for minimizing the predictive probability of misclassification. However, their drawback is that for large data sets the computation of the optimal decision boundary is a time consuming function of the size of the training set. Hence several methods have been proposed to speed up the SVM algorithm. Here three methods used to speed up the computation of the SVM classifiers are compared experimentally using a musical genre classification problem. The simplest method pre-selects a random sample of the data before the application of the SVM algorithm. Two additional methods use proximity graphs to pre-select data that are near the decision boundary. One uses k-Nearest Neighbor graphs and the other Relative Neighborhood Graphs to accomplish the task.