Abstract: Breadth-First Search (BFS) is a core graph algorithm that is widely used for graph analysis. As it is frequently used in many graph applications, improving the BFS performance is essential. In this paper, we present a graph ordering method that could reorder the graph nodes to achieve better data locality, thus, improving the BFS performance. Our method is based on an observation that the sibling relationships will dominate the cache access pattern during the BFS traversal. Therefore, we propose a frequency-based model to construct the graph order. First, we optimize the graph order according to the nodes’ visit frequency. Nodes with high visit frequency will be processed in priority. Second, we try to maximize the child nodes’ overlap layer by layer. As it is proved to be NP-hard, we propose a heuristic method that could greatly reduce the preprocessing overheads.We conduct extensive experiments on 16 real-world datasets. The result shows that our method could achieve comparable performance with the state-of-the-art methods while the graph ordering overheads are only about 1/15.
Abstract: Urban flooding resulting from a sudden release of
water due to dam-break or excessive rainfall is a serious threatening
environment hazard, which causes loss of human life and large
economic losses. Anticipating floods before they occur could
minimize human and economic losses through the implementation
of appropriate protection, provision, and rescue plans. This work
reports on the numerical modelling of flash flood propagation
in urban areas after an excessive rainfall event or dam-break.
A two-dimensional (2D) depth-averaged shallow water model is
used with a refined unstructured grid of triangles for representing
the urban area topography. The 2D shallow water equations are
solved using a second-order well-balanced discontinuous Galerkin
scheme. Theoretical test case and three flood events are described
to demonstrate the potential benefits of the scheme: (i) wetting and
drying in a parabolic basin (ii) flash flood over a physical model of
the urbanized Toce River valley in Italy; (iii) wave propagation on
the Reyran river valley in consequence of the Malpasset dam-break
in 1959 (France); and (iv) dam-break flood in October 1982 at the
town of Sumacarcel (Spain). The capability of the scheme is also
verified against alternative models. Computational results compare
well with recorded data and show that the scheme is at least as
efficient as comparable second-order finite volume schemes, with
notable efficiency speedup due to parallelization.
Abstract: Even though past, current and future trends suggest that multicore and cloud computing systems are increasingly prevalent/ubiquitous, this class of parallel systems is nonetheless underutilized, in general, and barely used for research on employing parallel Delaunay triangulation for parallel surface modeling and generation, in particular. The performances, of actual/physical and virtual/cloud multicore systems/machines, at executing various algorithms, which implement various parallelization strategies of the incremental insertion technique of the Delaunay triangulation algorithm, were evaluated. T-tests were run on the data collected, in order to determine whether various performance metrics differences (including execution time, speedup and efficiency) were statistically significant. Results show that the actual machine is approximately twice faster than the virtual machine at executing the same programs for the various parallelization strategies. Results, which furnish the scalability behaviors of the various parallelization strategies, also show that some of the differences between the performances of these systems, during different runs of the algorithms on the systems, were statistically significant. A few pseudo superlinear speedup results, which were computed from the raw data collected, are not true superlinear speedup values. These pseudo superlinear speedup values, which arise as a result of one way of computing speedups, disappear and give way to asymmetric speedups, which are the accurate kind of speedups that occur in the experiments performed.
Abstract: Fractal based digital image compression is a specific
technique in the field of color image. The method is best suited for
irregular shape of image like snow bobs, clouds, flame of fire; tree
leaves images, depending on the fact that parts of an image often
resemble with other parts of the same image. This technique has
drawn much attention in recent years because of very high
compression ratio that can be achieved. Hybrid scheme incorporating
fractal compression and speedup techniques have achieved high
compression ratio compared to pure fractal compression. Fractal
image compression is a lossy compression method in which selfsimilarity
nature of an image is used. This technique provides high
compression ratio, less encoding time and fart decoding process. In
this paper, fractal compression with quad tree and DCT is proposed
to compress the color image. The proposed hybrid schemes require
four phases to compress the color image. First: the image is
segmented and Discrete Cosine Transform is applied to each block of
the segmented image. Second: the block values are scanned in a
zigzag manner to prevent zero co-efficient. Third: the resulting image
is partitioned as fractals by quadtree approach. Fourth: the image is
compressed using Run length encoding technique.
Abstract: The Scheduling and mapping of tasks on a set of
processors is considered as a critical problem in parallel and
distributed computing system. This paper deals with the problem of
dynamic scheduling on a special type of multiprocessor architecture
known as Linear Crossed Cube (LCQ) network. This proposed
multiprocessor is a hybrid network which combines the features of
both linear types of architectures as well as cube based architectures.
Two standard dynamic scheduling schemes namely Minimum
Distance Scheduling (MDS) and Two Round Scheduling (TRS)
schemes are implemented on the LCQ network. Parallel tasks are
mapped and the imbalance of load is evaluated on different set of
processors in LCQ network. The simulations results are evaluated
and effort is made by means of through analysis of the results to
obtain the best solution for the given network in term of load
imbalance left and execution time. The other performance matrices
like speedup and efficiency are also evaluated with the given
dynamic algorithms.
Abstract: This paper presents system level CMOS solid-state
nanopore techniques enhancement for speedup next generation
molecular recording and high throughput channels. This discussion
also considers optimum number of base-pair (bp) measurements
through channel as an important role to enhance potential read
accuracy. Effective power consumption estimation offered suitable
range of multi-channel configuration. Nanopore bp extraction model
in statistical method could contribute higher read accuracy with
longer read-length (200 < read-length). Nanopore ionic current
switching with Time Multiplexing (TM) based multichannel readout
system contributed hardware savings.
Abstract: Medical image is an integral part of e-health care and e-diagnosis system. Medical image watermarking is widely used to protect patients’ information from malicious alteration and manipulation. The watermarked medical images are transmitted over the internet among patients, primary and referred physicians. The images are highly prone to corruption in the wireless transmission medium due to various noises, deflection, and refractions. Distortion in the received images leads to faulty watermark detection and inappropriate disease diagnosis. To address the issue, this paper utilizes error correction code (ECC) with (8, 4) Hamming code in an existing watermarking system. In addition, we implement the high complex ECC on a graphics processing units (GPU) to accelerate and support real-time requirement. Experimental results show that GPU achieves considerable speedup over the sequential CPU implementation, while maintaining 100% ECC efficiency.
Abstract: Accurate modeling of high speed RLC interconnects
has become a necessity to address signal integrity issues in current
VLSI design. To accurately model a dispersive system of interconnects
at higher frequencies; a full-wave analysis is required.
However, conventional circuit simulation of interconnects with full
wave models is extremely CPU expensive. We present an algorithm
for reducing large VLSI circuits to much smaller ones with similar
input-output behavior. A key feature of our method, called Frequency
Shift Technique, is that it is capable of reducing linear time-varying
systems. This enables it to capture frequency-translation and sampling
behavior, important in communication subsystems such as mixers,
RF components and switched-capacitor filters. Reduction is obtained
by projecting the original system described by linear differential
equations into a lower dimension. Experiments have been carried out
using Cadence Design Simulator cwhich indicates that the proposed
technique achieves more % reduction with less CPU time than the
other model order reduction techniques existing in literature. We
also present applications to RF circuit subsystems, obtaining size
reductions and evaluation speedups of orders of magnitude with
insignificant loss of accuracy.
Abstract: Scale Invariant Feature Transform (SIFT) has been
widely applied, but extracting SIFT feature is complicated and
time-consuming. In this paper, to meet the demand of the real-time
applications, SIFT is parallelized and optimized on cluster system,
which is named pSIFT. Redundancy storage and communication are
used for boundary data to improve the performance, and before
representation of feature descriptor, data reallocation is adopted to
keep load balance in pSIFT. Experimental results show that pSIFT
achieves good speedup and scalability.
Abstract: A Simultaneous Multithreading (SMT) Processor is
capable of executing instructions from multiple threads in the same
cycle. SMT in fact was introduced as a powerful architecture to
superscalar to increase the throughput of the processor.
Simultaneous Multithreading is a technique that permits multiple
instructions from multiple independent applications or threads to
compete limited resources each cycle. While the fetch unit has been
identified as one of the major bottlenecks of SMT architecture, several
fetch schemes were proposed by prior works to enhance the fetching
efficiency and overall performance.
In this paper, we propose a novel fetch policy called queue situation
identifier (QSI) which counts some kind of long latency instructions of
each thread each cycle then properly selects which threads to fetch
next cycle. Simulation results show that in best case our fetch policy
can achieve 30% on speedup and also can reduce the data cache level 1
miss rate.
Abstract: Application-Specific Instruction (ASI ) set Processors
(ASIP) have become an important design choice for embedded
systems due to runtime flexibility, which cannot be provided by
custom ASIC solutions. One major bottleneck in maximizing ASIP
performance is the limitation on the data bandwidth between the
General Purpose Register File (GPRF) and ASIs. This paper presents
the Implicit Registers (IRs) to provide the desirable data bandwidth.
An ASI Input/Output model is proposed to formulate the overheads of
the additional data transfer between the GPRF and IRs, therefore,
an IRs allocation algorithm is used to achieve the better performance
by minimizing the number of extra data transfer instructions. The
experiment results show an up to 3.33x speedup compared to the
results without using IRs.
Abstract: Due to new distributed database applications such as
huge deductive database systems, the search complexity is constantly
increasing and we need better algorithms to speedup traditional
relational database queries. An optimal dynamic programming
method for such high dimensional queries has the big disadvantage of
its exponential order and thus we are interested in semi-optimal but
faster approaches. In this work we present a multi-agent based
mechanism to meet this demand and also compare the result with
some commonly used query optimization algorithms.
Abstract: A highly optimized implementation of binary mixture
diffusion with no initial bulk velocity on graphics processors is
presented. The lattice Boltzmann model is employed for simulating
the binary diffusion of oxygen and nitrogen into each other with
different initial concentration distributions. Simulations have been
performed using the latest proposed lattice Boltzmann model that
satisfies both the indifferentiability principle and the H-theorem for
multi-component gas mixtures. Contemporary numerical
optimization techniques such as memory alignment and increasing
the multiprocessor occupancy are exploited along with some novel
optimization strategies to enhance the computational performance on
graphics processors using the C for CUDA programming language.
Speedup of more than two orders of magnitude over single-core
processors is achieved on a variety of Graphical Processing Unit
(GPU) devices ranging from conventional graphics cards to
advanced, high-end GPUs, while the numerical results are in
excellent agreement with the available analytical and numerical data
in the literature.
Abstract: In Virtual organization, Knowledge Discovery (KD)
service contains distributed data resources and computing grid nodes.
Computational grid is integrated with data grid to form Knowledge
Grid, which implements Apriori algorithm for mining association
rule on grid network. This paper describes development of parallel
and distributed version of Apriori algorithm on Globus Toolkit using
Message Passing Interface extended with Grid Services (MPICHG2).
The creation of Knowledge Grid on top of data and
computational grid is to support decision making in real time
applications. In this paper, the case study describes design and
implementation of local and global mining of frequent item sets. The
experiments were conducted on different configurations of grid
network and computation time was recorded for each operation. We
analyzed our result with various grid configurations and it shows
speedup of computation time is almost superlinear.
Abstract: The study of proteomics reached unexpected levels of
interest, as a direct consequence of its discovered influence over some
complex biological phenomena, such as problematic diseases like
cancer. This paper presents the latest authors- achievements regarding
the analysis of the networks of proteins (interactome networks), by
computing more efficiently the betweenness centrality measure. The
paper introduces the concept of betweenness centrality, and then
describes how betweenness computation can help the interactome net-
work analysis. Current sequential implementations for the between-
ness computation do not perform satisfactory in terms of execution
times. The paper-s main contribution is centered towards introducing
a speedup technique for the betweenness computation, based on
modified shortest path algorithms for sparse graphs. Three optimized
generic algorithms for betweenness computation are described and
implemented, and their performance tested against real biological
data, which is part of the IntAct dataset.
Abstract: In this paper, a pipelined version of genetic algorithm,
called PLGA, and a corresponding hardware platform are described.
The basic operations of conventional GA (CGA) are made pipelined
using an appropriate selection scheme. The selection operator, used
here, is stochastic in nature and is called SA-selection. This helps
maintaining the basic generational nature of the proposed pipelined
GA (PLGA). A number of benchmark problems are used to compare
the performances of conventional roulette-wheel selection and the
SA-selection. These include unimodal and multimodal functions with
dimensionality varying from very small to very large. It is seen that
the SA-selection scheme is giving comparable performances with
respect to the classical roulette-wheel selection scheme, for all the
instances, when quality of solutions and rate of convergence are considered.
The speedups obtained by PLGA for different benchmarks
are found to be significant. It is shown that a complete hardware
pipeline can be developed using the proposed scheme, if parallel
evaluation of the fitness expression is possible. In this connection
a low-cost but very fast hardware evaluation unit is described.
Results of simulation experiments show that in a pipelined hardware
environment, PLGA will be much faster than CGA. In terms of
efficiency, PLGA is found to outperform parallel GA (PGA) also.
Abstract: In order to make conventional implicit algorithm to be applicable in large scale parallel computers , an interface prediction and correction of discontinuous finite element method is presented to solve time-dependent neutron transport equations under 2-D cylindrical geometry. Domain decomposition is adopted in the computational domain.The numerical experiments show that our parallel algorithm with explicit prediction and implicit correction has good precision, parallelism and simplicity. Especially, it can reach perfect speedup even on hundreds of processors for large-scale problems.
Abstract: A parallel block method based on Backward
Differentiation Formulas (BDF) is developed for the parallel solution
of stiff Ordinary Differential Equations (ODEs). Most common
methods for solving stiff systems of ODEs are based on implicit
formulae and solved using Newton iteration which requires repeated
solution of systems of linear equations with coefficient matrix, I -
hβJ . Here, J is the Jacobian matrix of the problem. In this paper,
the matrix operations is paralleled in order to reduce the cost of the
iterations. Numerical results are given to compare the speedup and
efficiency of parallel algorithm and that of sequential algorithm.
Abstract: Many studies have shown that parallelization decreases efficiency [1], [2]. There are many reasons for these decrements. This paper investigates those which appear in the context of parallel data integration. Integration processes generally cannot be allocated to packages of identical size (i. e. tasks of identical complexity). The reason for this is unknown heterogeneous input data which result in variable task lengths. Process delay is defined by the slowest processing node. It leads to a detrimental effect on the total processing time. With a real world example, this study will show that while process delay does initially increase with the introduction of more nodes it ultimately decreases again after a certain point. The example will make use of the cloud computing platform Hadoop and be run inside Amazon-s EC2 compute cloud. A stochastic model will be set up which can explain this effect.
Abstract: This paper presents an improved image segmentation
model with edge preserving regularization based on the
piecewise-smooth Mumford-Shah functional. A level set formulation
is considered for the Mumford-Shah functional minimization in
segmentation, and the corresponding partial difference equations are
solved by the backward Euler discretization. Aiming at encouraging
edge preserving regularization, a new edge indicator function is
introduced at level set frame. In which all the grid points which is used
to locate the level set curve are considered to avoid blurring the edges
and a nonlinear smooth constraint function as regularization term is
applied to smooth the image in the isophote direction instead of the
gradient direction. In implementation, some strategies such as a new
scheme for extension of u+ and u- computation of the grid points and
speedup of the convergence are studied to improve the efficacy of the
algorithm. The resulting algorithm has been implemented and
compared with the previous methods, and has been proved efficiently
by several cases.