FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule

Instance selection (IS) technique is used to reduce the data size to improve the performance of data mining methods. Recently, to process very large data set, several proposed methods divide the training set into some disjoint subsets and apply IS algorithms independently to each subset. In this paper, we analyze the limitation of these methods and give our viewpoint about how to divide and conquer in IS procedure. Then, based on fast condensed nearest neighbor (FCNN) rule, we propose a large data sets instance selection method with MapReduce framework. Besides ensuring the prediction accuracy and reduction rate, it has two desirable properties: First, it reduces the work load in the aggregation node; Second and most important, it produces the same result with the sequential version, which other parallel methods cannot achieve. We evaluate the performance of FCNN-MR on one small data set and two large data sets. The experimental results show that it is effective and practical.

Parallel Direct Integration Variable Step Block Method for Solving Large System of Higher Order Ordinary Differential Equations

The aim of this paper is to investigate the performance of the developed two point block method designed for two processors for solving directly non stiff large systems of higher order ordinary differential equations (ODEs). The method calculates the numerical solution at two points simultaneously and produces two new equally spaced solution values within a block and it is possible to assign the computational tasks at each time step to a single processor. The algorithm of the method was developed in C language and the parallel computation was done on a parallel shared memory environment. Numerical results are given to compare the efficiency of the developed method to the sequential timing. For large problems, the parallel implementation produced 1.95 speed-up and 98% efficiency for the two processors.

Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm

With the rapid development in the field of life sciences and the flooding of genomic information, the need for faster and scalable searching methods has become urgent. One of the approaches that were investigated is indexing. The indexing methods have been categorized into three categories which are the lengthbased index algorithms, transformation-based algorithms and mixed techniques-based algorithms. In this research, we focused on the transformation based methods. We embedded the N-gram method into the transformation-based method to build an inverted index table. We then applied the parallel methods to speed up the index building time and to reduce the overall retrieval time when querying the genomic database. Our experiments show that the use of N-Gram transformation algorithm is an economical solution; it saves time and space too. The result shows that the size of the index is smaller than the size of the dataset when the size of N-Gram is 5 and 6. The parallel N-Gram transformation algorithm-s results indicate that the uses of parallel programming with large dataset are promising which can be improved further.