Abstract: This paper presents a low-area and fully-reconfigurable Fast Fourier Transform (FFT) hardware design for 3GPP-LTE communication standard. It can fully support 32 different FFT sizes, up to 2048 FFT points. Besides, a special processing element is developed for making reconfigurable computing characteristics possible, while first-in first-out (FIFO) scheduling scheme design technique is proposed for hardware-friendly FIFO resource arranging. In a synthesis chip realization via TSMC 40 nm CMOS technology, the hardware circuit only occupies core area of 0.2325 mm2 and dissipates 233.5 mW at maximal operating frequency of 250 MHz.
Abstract: Motion estimation occupies the heaviest computation in HEVC (high efficiency video coding). Many fast algorithms such as TZS (test zone search) have been proposed to reduce the computation. Still the huge computation of the motion estimation is a critical issue in the implementation of HEVC video codec. In this paper, motion estimator architecture with optimized number of PEs (processing element) is presented by exploiting early termination. It also reduces hardware size by exploiting parallel processing. The presented motion estimator architecture has 8 PEs, and it can efficiently perform TZS with very high utilization of PEs.
Abstract: Wireless sensor networks (WSNs), are constantly in demand to process information more rapidly with less energy and area cost. Presently, processor based solutions have difficult to achieve high processing speed with low-power consumption. This paper presents a simple and accurate data processing scheme for low power wireless sensor node, based on reduced number of processing element (PE). The presented model provides a simple recursive structure (SRS) to process the sampled data in the wireless sensor environment and to reduce the power consumption in wireless sensor node. Based on this model, to process the incoming samples and produce a smaller amount of data sufficient to reconstruct the original signal. The ModelSim simulator used to simulate SRS structure. Functional simulation is carried out for the validation of the presented architecture. Xilinx Power Estimator (XPE) tool is used to measure the power consumption. The experimental results show the average power consumption of 91 mW; this is 42% improvement compared to the folded tree architecture.
Abstract: There are about 1% of the world population suffering
from the hidden disability known as epilepsy and major developing
countries are not fully equipped to counter this problem. In order to
reduce the inconvenience and danger of epilepsy, different methods
have been researched by using a artificial neural network (ANN)
classification to distinguish epileptic waveforms from normal brain
waveforms. This paper outlines the aim of achieving massive
ANN parallelization through a dedicated hardware using bit-serial
processing. The design of this bit-serial Neural Processing Element
(NPE) is presented which implements the functionality of a complete
neuron using variable accuracy. The proposed design has been tested
taking into consideration non-idealities of a hardware ANN. The NPE
consists of a bit-serial multiplier which uses only 16 logic elements
on an Altera Cyclone IV FPGA and a bit-serial ALU as well as a
look-up table. Arrays of NPEs can be driven by a single controller
which executes the neural processing algorithm. In conclusion, the
proposed compact NPE design allows the construction of complex
hardware ANNs that can be implemented in a portable equipment
that suits the needs of a single epileptic patient in his or her daily
activities to predict the occurrences of impending tonic conic seizures.
Abstract: This paper proposes and implements an core transform architecture, which is one of the major processes in HEVC video compression standard. The proposed core transform architecture is implemented with only adders and shifters instead of area-consuming multipliers. Shifters in the proposed core transform architecture are implemented in wires and multiplexers, which significantly reduces chip area. Also, it can process from 4×4 to 16×16 blocks with common hardware by reusing processing elements. Designed core transform architecture in 0.13um technology can process a 16×16 block with 2-D transform in 130 cycles, and its gate count is 101,015 gates.
Abstract: Fixed-point simulation results are used for the performance measure of inverting matrices using a reconfigurable processing element. Matrices are inverted using the Cholesky decomposition algorithm. The reconfigurable processing element is capable of all required mathematical operations. The fixed-point word length analysis is based on simulations of different condition numbers and different matrix sizes.
Abstract: A lot of Scientific and Engineering problems require the solution of large systems of linear equations of the form bAx in an effective manner. LU-Decomposition offers good choices for solving this problem. Our approach is to find the lower bound of processing elements needed for this purpose. Here is used the so called Omega calculus, as a computational method for solving problems via their corresponding Diophantine relation. From the corresponding algorithm is formed a system of linear diophantine equalities using the domain of computation which is given by the set of lattice points inside the polyhedron. Then is run the Mathematica program DiophantineGF.m. This program calculates the generating function from which is possible to find the number of solutions to the system of Diophantine equalities, which in fact gives the lower bound for the number of processors needed for the corresponding algorithm. There is given a mathematical explanation of the problem as well. Keywordsgenerating function, lattice points in polyhedron, lower bound of processor elements, system of Diophantine equationsand : calculus.
Abstract: Fixed-point simulation results are used for the
performance measure of inverting matrices by Cholesky
decomposition. The fixed-point Cholesky decomposition algorithm
is implemented using a fixed-point reconfigurable processing
element. The reconfigurable processing element provides all
mathematical operations required by Cholesky decomposition. The
fixed-point word length analysis is based on simulations using
different condition numbers and different matrix sizes. Simulation
results show that 16 bits word length gives sufficient performance
for small matrices with low condition number. Larger matrices and
higher condition numbers require more dynamic range for a fixedpoint
implementation.
Abstract: Speedups from mapping four real-life DSP
applications on an embedded system-on-chip that couples coarsegrained
reconfigurable logic with an instruction-set processor are
presented. The reconfigurable logic is realized by a 2-Dimensional
Array of Processing Elements. A design flow for improving
application-s performance is proposed. Critical software parts, called
kernels, are accelerated on the Coarse-Grained Reconfigurable
Array. The kernels are detected by profiling the source code. For
mapping the detected kernels on the reconfigurable logic a prioritybased
mapping algorithm has been developed. Two 4x4 array
architectures, which differ in their interconnection structure among
the Processing Elements, are considered. The experiments for eight
different instances of a generic system show that important overall
application speedups have been reported for the four applications.
The performance improvements range from 1.86 to 3.67, with an
average value of 2.53, compared with an all-software execution.
These speedups are quite close to the maximum theoretical speedups
imposed by Amdahl-s law.