Abstract: Advances in processors architecture, such as multicore,
increase the size of complexity of parallel computer systems.
With multi-core architecture there are different parallel languages
that can be used to run parallel programs. One of these languages is
OpenMP which embedded in C/Cµ or FORTRAN. Because of this
new architecture and the complexity, it is very important to evaluate
the performance of OpenMP constructs, kernels, and application
program on multi-core systems. Performance is the activity of
collecting the information about the execution characteristics of a
program. Performance tools consists of at least three interfacing
software layers, including instrumentation, measurement, and
analysis. The instrumentation layer defines the measured
performance events. The measurement layer determines what
performance event is actually captured and how it is measured by the
tool. The analysis layer processes the performance data and
summarizes it into a form that can be displayed in performance tools.
In this paper, a number of OpenMP performance tools are surveyed,
explaining how each is used to collect, analyse, and display data
collection.
Abstract: Although the Vietnamese catfish farming has grown
at very high rates in recent years, the industry has also faced many
problems affecting its sustainability. This paper studies the
perceptions of catfish farmers regarding risk and risk management
strategies in their production activities. Specifically, the study aims
to measure the consequences, likelihoods, and levels of risks as well
as the efficacy of risk management in Vietnamese catfish farming.
Data for the study were collected through a sample of 261 catfish
farmers in the Mekong Delta, Vietnam using a questionnaire survey
in 2008. Results show that, in general, price and production risks
were perceived as the most important risks. Farm management and
technical measures were perceived more effective than other kinds of
risk management strategies in risk reduction. Although price risks
were rated as important risks, price risk management strategies were
not perceived as important measures for risk mitigation. The results
of the study are discussed to provide implications for various
industry stakeholders, including policy makers, processors, advisors,
and developers of new risk management strategies.
Abstract: Many problems in computer vision and image
processing present potential for parallel implementations through one
of the three major paradigms of geometric parallelism, algorithmic
parallelism and processor farming. Static process scheduling
techniques are used successfully to exploit geometric and algorithmic
parallelism, while dynamic process scheduling is better suited to
dealing with the independent processes inherent in the process
farming paradigm. This paper considers the application of parallel or
multi-computers to a class of problems exhibiting spatial data
characteristic of the geometric paradigm. However, by using
processor farming paradigm, a dynamic scheduling technique is
developed to suit the MIMD structure of the multi-computers. A
hybrid scheme of scheduling is also developed and compared with
the other schemes. The specific problem chosen for the investigation
is the Hough transform for line detection.
Abstract: In this work we present a solution for DAGC (Digital
Automatic Gain Control) in WLAN receivers compatible to IEEE 802.11a/g standard. Those standards define communication in 5/2.4
GHz band using Orthogonal Frequency Division Multiplexing OFDM modulation scheme. WLAN Transceiver that we have used
enables gain control over Low Noise Amplifier (LNA) and a
Variable Gain Amplifier (VGA). The control over those signals is
performed in our digital baseband processor using dedicated hardware block DAGC. DAGC in this process is used to automatically control the VGA and LNA in order to achieve better
signal-to-noise ratio, decrease FER (Frame Error Rate) and hold the
average power of the baseband signal close to the desired set point.
DAGC function in baseband processor is done in few steps: measuring power levels of baseband samples of an RF signal,accumulating the differences between the measured power level and
actual gain setting, adjusting a gain factor of the accumulation, and
applying the adjusted gain factor the baseband values. Based on the measurement results of RSSI signal dependence to input power we have concluded that this digital AGC can be implemented applying
the simple linearization of the RSSI. This solution is very simple but also effective and reduces complexity and power consumption of the
DAGC. This DAGC is implemented and tested both in FPGA and in ASIC as a part of our WLAN baseband processor. Finally, we have integrated this circuit in a compact WLAN PCMCIA board based on MAC and baseband ASIC chips designed from us.
Abstract: Load balancing in distributed computer systems is the
process of redistributing the work load among processors in the
system to improve system performance. Most of previous research in
using fuzzy logic for the purpose of load balancing has only
concentrated in utilizing fuzzy logic concepts in describing
processors load and tasks execution length. The responsibility of the
fuzzy-based load balancing process itself, however, has not been
discussed and in most reported work is assumed to be performed in a
distributed fashion by all nodes in the network. This paper proposes a
new fuzzy dynamic load balancing algorithm for homogenous
distributed systems. The proposed algorithm utilizes fuzzy logic in
dealing with inaccurate load information, making load distribution
decisions, and maintaining overall system stability. In terms of
control, we propose a new approach that specifies how, when, and by
which node the load balancing is implemented. Our approach is
called Centralized-But-Distributed (CBD).
Abstract: Design and implementation of a novel B-ACOSD CFAR algorithm is presented in this paper. It is proposed for detecting radar target in log-normal distribution environment. The BACOSD detector is capable to detect automatically the number interference target in the reference cells and detect the real target by an adaptive threshold. The detector is implemented as a System on Chip on FPGA Altera Stratix II using parallelism and pipelining technique. For a reference window of length 16 cells, the experimental results showed that the processor works properly with a processing speed up to 115.13MHz and processing time0.29 ┬Ás, thus meets real-time requirement for a typical radar system.
Abstract: In this paper, an automatic detecting algorithm for
QRS complex detecting was applied for analyzing ECG recordings
and five criteria for dangerous arrhythmia diagnosing are applied for a
protocol type of automatic arrhythmia diagnosing system. The
automatic detecting algorithm applied in this paper detected the
distribution of QRS complexes in ECG recordings and related
information, such as heart rate and RR interval. In this investigation,
twenty sampled ECG recordings of patients with different pathologic
conditions were collected for off-line analysis. A combinative
application of four digital filters for bettering ECG signals and
promoting detecting rate for QRS complex was proposed as
pre-processing. Both of hardware filters and digital filters were
applied to eliminate different types of noises mixed with ECG
recordings. Then, an automatic detecting algorithm of QRS complex
was applied for verifying the distribution of QRS complex. Finally,
the quantitative clinic criteria for diagnosing arrhythmia were
programmed in a practical application for automatic arrhythmia
diagnosing as a post-processor. The results of diagnoses by automatic
dangerous arrhythmia diagnosing were compared with the results of
off-line diagnoses by experienced clinic physicians. The results of
comparison showed the application of automatic dangerous
arrhythmia diagnosis performed a matching rate of 95% compared
with an experienced physician-s diagnoses.
Abstract: Efficient modulo 2n+1 adders are important for
several applications including residue number system, digital signal
processors and cryptography algorithms. In this paper we present a
novel modulo 2n+1 addition algorithm for a recently represented
number system. The proposed approach is introduced for the
reduction of the power dissipated. In a conventional modulo 2n+1
adder, all operands have (n+1)-bit length. To avoid using (n+1)-bit
circuits, the diminished-1 and carry save diminished-1 number
systems can be effectively used in applications. In the paper, we also
derive two new architectures for designing modulo 2n+1 adder, based
on n-bit ripple-carry adder. The first architecture is a faster design
whereas the second one uses less hardware. In the proposed method,
the special treatment required for zero operands in Diminished-1
number system is removed. In the fastest modulo 2n+1 adders in
normal binary system, there are 3-operand adders. This problem is
also resolved in this paper. The proposed architectures are compared
with some efficient adders based on ripple-carry adder and highspeed
adder. It is shown that the hardware overhead and power
consumption will be reduced. As well as power reduction, in some
cases, power-delay product will be also reduced.
Abstract: This paper presents an efficient VLSI architecture
design to achieve real time video processing using Full-Search Block
Matching (FSBM) algorithm. The design employs parallel bank
architecture with minimum latency, maximum throughput, and full
hardware utilization. We use nine parallel processors in our
architecture and each controlled by a state machine. State machine
control implementation makes the design very simple and cost
effective. The design is implemented using VHDL and the
programming techniques we incorporated makes the design
completely programmable in the sense that the search ranges and the
block sizes can be varied to suit any given requirements. The design
can operate at frequencies up to 36 MHz and it can function in QCIF
and CIF video resolution at 1.46 MHz and 5.86 MHz, respectively.
Abstract: A Simultaneous Multithreading (SMT) Processor is
capable of executing instructions from multiple threads in the same
cycle. SMT in fact was introduced as a powerful architecture to
superscalar to increase the throughput of the processor.
Simultaneous Multithreading is a technique that permits multiple
instructions from multiple independent applications or threads to
compete limited resources each cycle. While the fetch unit has been
identified as one of the major bottlenecks of SMT architecture, several
fetch schemes were proposed by prior works to enhance the fetching
efficiency and overall performance.
In this paper, we propose a novel fetch policy called queue situation
identifier (QSI) which counts some kind of long latency instructions of
each thread each cycle then properly selects which threads to fetch
next cycle. Simulation results show that in best case our fetch policy
can achieve 30% on speedup and also can reduce the data cache level 1
miss rate.
Abstract: This paper presents software tools that convert the C/Cµ floating point source code for a DSP algorithm into a fixedpoint simulation model that can be used to evaluate the numericalperformance of the algorithm on several different fixed pointplatforms including microprocessors, DSPs and FPGAs. The tools use a novel system for maintaining binary point informationso that the conversion from floating point to fixed point isautomated and the resulting fixed point algorithm achieves maximum possible precision. A configurable architecture is used during the simulation phase so that the algorithm can produce a bit-exact output for several different target devices.
Abstract: The aim of this paper is to investigate the
performance of the developed two point block method designed for
two processors for solving directly non stiff large systems of higher
order ordinary differential equations (ODEs). The method calculates
the numerical solution at two points simultaneously and produces
two new equally spaced solution values within a block and it is
possible to assign the computational tasks at each time step to a
single processor. The algorithm of the method was developed in C
language and the parallel computation was done on a parallel shared
memory environment. Numerical results are given to compare the
efficiency of the developed method to the sequential timing. For
large problems, the parallel implementation produced 1.95 speed-up
and 98% efficiency for the two processors.
Abstract: SAD (Sum of Absolute Difference) algorithm is
heavily used in motion estimation which is computationally highly
demanding process in motion picture encoding. To enhance the
performance of motion picture encoding on a VLIW processor, an
efficient implementation of SAD algorithm on the VLIW processor is
essential. SAD algorithm is programmed as a nested loop with a
conditional branch. In VLIW processors, loop is usually optimized by
software pipelining, but researches on optimal scheduling of software
pipelining for nested loops, especially nested loops with conditional
branches are rare. In this paper, we propose an optimal scheduling and
implementation of SAD algorithm with conditional branch on a VLIW
DSP processor. The proposed optimal scheduling first transforms the
nested loop with conditional branch into a single loop with conditional
branch with consideration of full utilization of ILP capability of the
VLIW processor and realization of earlier escape from the loop. Next,
the proposed optimal scheduling applies a modulo scheduling
technique developed for single loop. Based on this optimal scheduling
strategy, optimal implementation of SAD algorithm on TMS320C67x,
a VLIW DSP is presented. Through experiments on TMS320C6713
DSK, it is shown that H.263 encoder with the proposed SAD
implementation performs better than other H.263 encoder with other
SAD implementations, and that the code size of the optimal SAD
implementation is small enough to be appropriate for embedded
environments.
Abstract: This paper addresses control of commutation of switched reluctance (SR) motor without the use of a physical position detector. Rotor position detection schemes for SR motor based on magnetisation characteristics of the motor use normal excitation or applied current /voltage pulses. The resulting schemes are referred to as passive or active methods respectively. The research effort is in realizing an economical sensorless SR rotor position detector that is accurate, reliable and robust to suit a particular application. An effective and reliable means of generating commutation signals of an SR motor based on inductance profile of its stator windings determined using active probing technique is presented. The scheme has been validated online using a 4-phase 8/6 SR motor and an 8-bit processor.
Abstract: IPsec protocol[1] is a set of security extensions
developed by the IETF and it provides privacy and authentication
services at the IP layer by using modern cryptography. In this paper,
we describe both of H/W and S/W architectures of our router system,
SRS-10. The system is designed to support high performance routing
and IPsec VPN. Especially, we used Cavium-s CN2560 processor to
implement IPsec processing in inline-mode.
Abstract: We suggest a novel method to incorporate longterm
redundancy (LTR) in signal time domain compression
methods. The proposition is based on block-sorting and curve
simplification. The proposition is illustrated on the ECG
signal as a post-processor for the FAN method. Test
applications on the new so-obtained FAN+ method using the
MIT-BIH database show substantial improvement of the
compression ratio-distortion behavior for a higher quality
reconstructed signal.
Abstract: ''Cocktail party problem'' is well known as one of the human auditory abilities. We can recognize the specific sound that we want to listen by this ability even if a lot of undesirable sounds or noises are mixed. Blind source separation (BSS) based on independent component analysis (ICA) is one of the methods by which we can separate only a special signal from their mixed signals with simple hypothesis. In this paper, we propose an online approach for blind source separation using the sliding DFT and the time domain independent component analysis. The proposed method can reduce calculation complexity in comparison with conventional methods, and can be applied to parallel processing by using digital signal processors (DSPs) and so on. We evaluate this method and show its availability.
Abstract: Performance of a limited Round-Robin (RR) rule is
studied in order to clarify the characteristics of a realistic sharing
model of a processor. Under the limited RR rule, the processor
allocates to each request a fixed amount of time, called a quantum, in a
fixed order. The sum of the requests being allocated these quanta is
kept below a fixed value. Arriving requests that cannot be allocated
quanta because of such a restriction are queued or rejected. Practical
performance measures, such as the relationship between the mean
sojourn time, the mean number of requests, or the loss probability and
the quantum size are evaluated via simulation. In the evaluation, the
requested service time of an arriving request is converted into a
quantum number. One of these quanta is included in an RR cycle,
which means a series of quanta allocated to each request in a fixed
order. The service time of the arriving request can be evaluated using
the number of RR cycles required to complete the service, the number
of requests receiving service, and the quantum size. Then an increase
or decrease in the number of quanta that are necessary before service is
completed is reevaluated at the arrival or departure of other requests.
Tracking these events and calculations enables us to analyze the
performance of our limited RR rule. In particular, we obtain the most
suitable quantum size, which minimizes the mean sojourn time, for the
case in which the switching time for each quantum is considered.
Abstract: Application-Specific Instruction (ASI ) set Processors
(ASIP) have become an important design choice for embedded
systems due to runtime flexibility, which cannot be provided by
custom ASIC solutions. One major bottleneck in maximizing ASIP
performance is the limitation on the data bandwidth between the
General Purpose Register File (GPRF) and ASIs. This paper presents
the Implicit Registers (IRs) to provide the desirable data bandwidth.
An ASI Input/Output model is proposed to formulate the overheads of
the additional data transfer between the GPRF and IRs, therefore,
an IRs allocation algorithm is used to achieve the better performance
by minimizing the number of extra data transfer instructions. The
experiment results show an up to 3.33x speedup compared to the
results without using IRs.
Abstract: With the necessity of increased processing capacity with less energy consumption; power aware multiprocessor system has gained more attention in the recent future. One of the additional challenges that is to be solved in a multi-processor system when compared to uni-processor system is job allocation. This paper presents a novel task dependent job allocation algorithm: Energy centric- Allocation (Ec-A) and Rate Monotonic (RM) scheduling to minimize energy consumption in a multiprocessor system. A simulation analysis is carried out to verify the performance increase with reduction in energy consumption and required number of processors in the system.