A Low-Area Fully-Reconfigurable Hardware Design of Fast Fourier Transform System for 3GPP-LTE Standard

This paper presents a low-area and fully-reconfigurable Fast Fourier Transform (FFT) hardware design for 3GPP-LTE communication standard. It can fully support 32 different FFT sizes, up to 2048 FFT points. Besides, a special processing element is developed for making reconfigurable computing characteristics possible, while first-in first-out (FIFO) scheduling scheme design technique is proposed for hardware-friendly FIFO resource arranging. In a synthesis chip realization via TSMC 40 nm CMOS technology, the hardware circuit only occupies core area of 0.2325 mm2 and dissipates 233.5 mW at maximal operating frequency of 250 MHz.

Motion Estimator Architecture with Optimized Number of Processing Elements for High Efficiency Video Coding

Motion estimation occupies the heaviest computation in HEVC (high efficiency video coding). Many fast algorithms such as TZS (test zone search) have been proposed to reduce the computation. Still the huge computation of the motion estimation is a critical issue in the implementation of HEVC video codec. In this paper, motion estimator architecture with optimized number of PEs (processing element) is presented by exploiting early termination. It also reduces hardware size by exploiting parallel processing. The presented motion estimator architecture has 8 PEs, and it can efficiently perform TZS with very high utilization of PEs.

Enhancing the Performance of Wireless Sensor Networks Using Low Power Design

Wireless sensor networks (WSNs), are constantly in demand to process information more rapidly with less energy and area cost. Presently, processor based solutions have difficult to achieve high processing speed with low-power consumption. This paper presents a simple and accurate data processing scheme for low power wireless sensor node, based on reduced number of processing element (PE). The presented model provides a simple recursive structure (SRS) to process the sampled data in the wireless sensor environment and to reduce the power consumption in wireless sensor node. Based on this model, to process the incoming samples and produce a smaller amount of data sufficient to reconstruct the original signal. The ModelSim simulator used to simulate SRS structure. Functional simulation is carried out for the validation of the presented architecture. Xilinx Power Estimator (XPE) tool is used to measure the power consumption. The experimental results show the average power consumption of 91 mW; this is 42% improvement compared to the folded tree architecture.

Massively-Parallel Bit-Serial Neural Networks for Fast Epilepsy Diagnosis: A Feasibility Study

There are about 1% of the world population suffering from the hidden disability known as epilepsy and major developing countries are not fully equipped to counter this problem. In order to reduce the inconvenience and danger of epilepsy, different methods have been researched by using a artificial neural network (ANN) classification to distinguish epileptic waveforms from normal brain waveforms. This paper outlines the aim of achieving massive ANN parallelization through a dedicated hardware using bit-serial processing. The design of this bit-serial Neural Processing Element (NPE) is presented which implements the functionality of a complete neuron using variable accuracy. The proposed design has been tested taking into consideration non-idealities of a hardware ANN. The NPE consists of a bit-serial multiplier which uses only 16 logic elements on an Altera Cyclone IV FPGA and a bit-serial ALU as well as a look-up table. Arrays of NPEs can be driven by a single controller which executes the neural processing algorithm. In conclusion, the proposed compact NPE design allows the construction of complex hardware ANNs that can be implemented in a portable equipment that suits the needs of a single epileptic patient in his or her daily activities to predict the occurrences of impending tonic conic seizures.

Design of Low-Area HEVC Core Transform Architecture

This paper proposes and implements an core transform architecture, which is one of the major processes in HEVC video compression standard. The proposed core transform architecture is implemented with only adders and shifters instead of area-consuming multipliers. Shifters in the proposed core transform architecture are implemented in wires and multiplexers, which significantly reduces chip area. Also, it can process from 4×4 to 16×16 blocks with common hardware by reusing processing elements. Designed core transform architecture in 0.13um technology can process a 16×16 block with 2-D transform in 130 cycles, and its gate count is 101,015 gates.

A Reconfigurable Processing Element Implementation for Matrix Inversion Using Cholesky Decomposition

Fixed-point simulation results are used for the performance measure of inverting matrices using a reconfigurable processing element. Matrices are inverted using the Cholesky decomposition algorithm. The reconfigurable processing element is capable of all required mathematical operations. The fixed-point word length analysis is based on simulations of different condition numbers and different matrix sizes.

An Implementation of MacMahon's Partition Analysis in Ordering the Lower Bound of Processing Elements for the Algorithm of LU Decomposition

A lot of Scientific and Engineering problems require the solution of large systems of linear equations of the form bAx in an effective manner. LU-Decomposition offers good choices for solving this problem. Our approach is to find the lower bound of processing elements needed for this purpose. Here is used the so called Omega calculus, as a computational method for solving problems via their corresponding Diophantine relation. From the corresponding algorithm is formed a system of linear diophantine equalities using the domain of computation which is given by the set of lattice points inside the polyhedron. Then is run the Mathematica program DiophantineGF.m. This program calculates the generating function from which is possible to find the number of solutions to the system of Diophantine equalities, which in fact gives the lower bound for the number of processors needed for the corresponding algorithm. There is given a mathematical explanation of the problem as well. Keywordsgenerating function, lattice points in polyhedron, lower bound of processor elements, system of Diophantine equationsand : calculus.

A Reconfigurable Processing Element for Cholesky Decomposition and Matrix Inversion

Fixed-point simulation results are used for the performance measure of inverting matrices by Cholesky decomposition. The fixed-point Cholesky decomposition algorithm is implemented using a fixed-point reconfigurable processing element. The reconfigurable processing element provides all mathematical operations required by Cholesky decomposition. The fixed-point word length analysis is based on simulations using different condition numbers and different matrix sizes. Simulation results show that 16 bits word length gives sufficient performance for small matrices with low condition number. Larger matrices and higher condition numbers require more dynamic range for a fixedpoint implementation.

Performance Improvements of DSP Applications on a Generic Reconfigurable Platform

Speedups from mapping four real-life DSP applications on an embedded system-on-chip that couples coarsegrained reconfigurable logic with an instruction-set processor are presented. The reconfigurable logic is realized by a 2-Dimensional Array of Processing Elements. A design flow for improving application-s performance is proposed. Critical software parts, called kernels, are accelerated on the Coarse-Grained Reconfigurable Array. The kernels are detected by profiling the source code. For mapping the detected kernels on the reconfigurable logic a prioritybased mapping algorithm has been developed. Two 4x4 array architectures, which differ in their interconnection structure among the Processing Elements, are considered. The experiments for eight different instances of a generic system show that important overall application speedups have been reported for the four applications. The performance improvements range from 1.86 to 3.67, with an average value of 2.53, compared with an all-software execution. These speedups are quite close to the maximum theoretical speedups imposed by Amdahl-s law.