Modified Scaling-Free CORDIC Based Pipelined Parallel MDC FFT and IFFT Architecture for Radix 2^2 Algorithm

An innovative approach to develop modified scaling free CORDIC based two parallel pipelined Multipath Delay Commutator (MDC) FFT and IFFT architectures for radix 22 FFT algorithm is presented. Multipliers and adders are the most important data paths in FFT and IFFT architectures. Multipliers occupy high area and consume more power. In order to optimize the area and power overhead, modified scaling-free CORDIC based complex multiplier is utilized in the proposed design. In general twiddle factor values are stored in RAM block. In the proposed work, modified scaling-free CORDIC based twiddle factor generator unit is used to generate the twiddle factor and efficient switching units are used. In addition to this, four point FFT operations are performed without complex multiplication which helps to reduce area and power in the last two stages of the pipelined architectures. The design proposed in this paper is based on multipath delay commutator method. The proposed design can be extended to any radix 2n based FFT/IFFT algorithm to improve the throughput. The work is synthesized using Synopsys design Compiler using TSMC 90-nm library. The proposed method proves to be better compared to the reference design in terms of area, throughput and power consumption. The comparative analysis of the proposed design with Xilinx FPGA platform is also discussed in the paper.

Simulation Based VLSI Implementation of Fast Efficient Lossless Image Compression System Using Adjusted Binary Code & Golumb Rice Code

The Simulation based VLSI Implementation of FELICS (Fast Efficient Lossless Image Compression System) Algorithm is proposed to provide the lossless image compression and is implemented in simulation oriented VLSI (Very Large Scale Integrated). To analysis the performance of Lossless image compression and to reduce the image without losing image quality and then implemented in VLSI based FELICS algorithm. In FELICS algorithm, which consists of simplified adjusted binary code for Image compression and these compression image is converted in pixel and then implemented in VLSI domain. This parameter is used to achieve high processing speed and minimize the area and power. The simplified adjusted binary code reduces the number of arithmetic operation and achieved high processing speed. The color difference preprocessing is also proposed to improve coding efficiency with simple arithmetic operation. Although VLSI based FELICS Algorithm provides effective solution for hardware architecture design for regular pipelining data flow parallelism with four stages. With two level parallelisms, consecutive pixels can be classified into even and odd samples and the individual hardware engine is dedicated for each one. This method can be further enhanced by multilevel parallelisms.

Design and Analysis of a Low Power High Speed 1 Bit Full Adder Cell Based On TSPC Logic with Multi-Threshold CMOS

An adder is one of the most integral component of a digital system like a digital signal processor or a microprocessor. Being an extremely computationally intensive part of a system, the optimization for speed and power consumption of the adder is of prime importance. In this paper we have designed a 1 bit full adder cell based on dynamic TSPC logic to achieve high speed operation. A high threshold voltage sleep transistor is used to reduce the static power dissipation in standby mode. The circuit is designed and simulated in TSPICE using TSMC 180nm CMOS process. Average power consumption, delay and power-delay product is measured which showed considerable improvement in performance over the existing full adder designs.

Transceiver for Differential Wave Pipe-Lined Serial Interconnect with Surfing

In the literature, surfing technique has been proposed for single ended wave-pipelined serial interconnects to increase the data transfer rate. In this paper a novel surfing technique is proposed for differential wave-pipelined serial interconnects, which uses a 'Controllable inverter pair' for surfing. To evaluate the efficiency of this technique, a transceiver with transmitter, receiver, delay locked loop (DLL) along with 40mm metal 4 interconnects using the proposed surfing technique is implemented in UMC 180nm technology and their performances are studied through post layout simulations. From the study, it is observed that the proposed scheme permits 1.875 times higher data transmission rate compared to the single ended scheme whose maximum data transfer rate is 1.33 GB/s. The proposed scheme has the ability to receive the correct data even with stuck-at-faults in the complementary line.

Design and Implementation of Real-Time Automatic Censoring System on Chip for Radar Detection

Design and implementation of a novel B-ACOSD CFAR algorithm is presented in this paper. It is proposed for detecting radar target in log-normal distribution environment. The BACOSD detector is capable to detect automatically the number interference target in the reference cells and detect the real target by an adaptive threshold. The detector is implemented as a System on Chip on FPGA Altera Stratix II using parallelism and pipelining technique. For a reference window of length 16 cells, the experimental results showed that the processor works properly with a processing speed up to 115.13MHz and processing time0.29 ┬Ás, thus meets real-time requirement for a typical radar system.

Optimization of SAD Algorithm on VLIW DSP

SAD (Sum of Absolute Difference) algorithm is heavily used in motion estimation which is computationally highly demanding process in motion picture encoding. To enhance the performance of motion picture encoding on a VLIW processor, an efficient implementation of SAD algorithm on the VLIW processor is essential. SAD algorithm is programmed as a nested loop with a conditional branch. In VLIW processors, loop is usually optimized by software pipelining, but researches on optimal scheduling of software pipelining for nested loops, especially nested loops with conditional branches are rare. In this paper, we propose an optimal scheduling and implementation of SAD algorithm with conditional branch on a VLIW DSP processor. The proposed optimal scheduling first transforms the nested loop with conditional branch into a single loop with conditional branch with consideration of full utilization of ILP capability of the VLIW processor and realization of earlier escape from the loop. Next, the proposed optimal scheduling applies a modulo scheduling technique developed for single loop. Based on this optimal scheduling strategy, optimal implementation of SAD algorithm on TMS320C67x, a VLIW DSP is presented. Through experiments on TMS320C6713 DSK, it is shown that H.263 encoder with the proposed SAD implementation performs better than other H.263 encoder with other SAD implementations, and that the code size of the optimal SAD implementation is small enough to be appropriate for embedded environments.

FPGA Based Longitudinal and Lateral Controller Implementation for a Small UAV

This paper presents implementation of attitude controller for a small UAV using field programmable gate array (FPGA). Due to the small size constrain a miniature more compact and computationally extensive; autopilot platform is needed for such systems. More over UAV autopilot has to deal with extremely adverse situations in the shortest possible time, while accomplishing its mission. FPGAs in the recent past have rendered themselves as fast, parallel, real time, processing devices in a compact size. This work utilizes this fact and implements different attitude controllers for a small UAV in FPGA, using its parallel processing capabilities. Attitude controller is designed in MATLAB/Simulink environment. The discrete version of this controller is implemented using pipelining followed by retiming, to reduce the critical path and thereby clock period of the controller datapath. Pipelined, retimed, parallel PID controller implementation is done using rapidprototyping and testing efficient development tool of “system generator", which has been developed by Xilinx for FPGA implementation. The improved timing performance enables the controller to react abruptly to any changes made to the attitudes of UAV.

FPGA Implementation of Generalized Maximal Ratio Combining Receiver Diversity

In this paper, we study FPGA implementation of a novel supra-optimal receiver diversity combining technique, generalized maximal ratio combining (GMRC), for wireless transmission over fading channels in SIMO systems. Prior published results using ML-detected GMRC diversity signal driven by BPSK showed superior bit error rate performance to the widely used MRC combining scheme in an imperfect channel estimation (ICE) environment. Under perfect channel estimation conditions, the performance of GMRC and MRC were identical. The main drawback of the GMRC study was that it was theoretical, thus successful FPGA implementation of it using pipeline techniques is needed as a wireless communication test-bed for practical real-life situations. Simulation results showed that the hardware implementation was efficient both in terms of speed and area. Since diversity combining is especially effective in small femto- and picocells, internet-associated wireless peripheral systems are to benefit most from GMRC. As a result, many spinoff applications can be made to the hardware of IP-based 4th generation networks.

Efficient Large Numbers Karatsuba-Ofman Multiplier Designs for Embedded Systems

Long number multiplications (n ≥ 128-bit) are a primitive in most cryptosystems. They can be performed better by using Karatsuba-Ofman technique. This algorithm is easy to parallelize on workstation network and on distributed memory, and it-s known as the practical method of choice. Multiplying long numbers using Karatsuba-Ofman algorithm is fast but is highly recursive. In this paper, we propose different designs of implementing Karatsuba-Ofman multiplier. A mixture of sequential and combinational system design techniques involving pipelining is applied to our proposed designs. Multiplying large numbers can be adapted flexibly to time, area and power criteria. Computationally and occupation constrained in embedded systems such as: smart cards, mobile phones..., multiplication of finite field elements can be achieved more efficiently. The proposed designs are compared to other existing techniques. Mathematical models (Area (n), Delay (n)) of our proposed designs are also elaborated and evaluated on different FPGAs devices.

High Performance VLSI Architecture of 2D Discrete Wavelet Transform with Scalable Lattice Structure

In this paper, we propose a fully-utilized, block-based 2D DWT (discrete wavelet transform) architecture, which consists of four 1D DWT filters with two-channel QMF lattice structure. The proposed architecture requires about 2MN-3N registers to save the intermediate results for higher level decomposition, where M and N stand for the filter length and the row width of the image respectively. Furthermore, the proposed 2D DWT processes in horizontal and vertical directions simultaneously without an idle period, so that it computes the DWT for an N×N image in a period of N2(1-2-2J)/3. Compared to the existing approaches, the proposed architecture shows 100% of hardware utilization and high throughput rates. To mitigate the long critical path delay due to the cascaded lattices, we can apply the pipeline technique with four stages, while retaining 100% of hardware utilization. The proposed architecture can be applied in real-time video signal processing.

A Pipelined FSBM Hardware Architecture for HTDV-H.26x

In MPEG and H.26x standards, to eliminate the temporal redundancy we use motion estimation. Given that the motion estimation stage is very complex in terms of computational effort, a hardware implementation on a re-configurable circuit is crucial for the requirements of different real time multimedia applications. In this paper, we present hardware architecture for motion estimation based on "Full Search Block Matching" (FSBM) algorithm. This architecture presents minimum latency, maximum throughput, full utilization of hardware resources such as embedded memory blocks, and combining both pipelining and parallel processing techniques. Our design is described in VHDL language, verified by simulation and implemented in a Stratix II EP2S130F1020C4 FPGA circuit. The experiment result show that the optimum operating clock frequency of the proposed design is 89MHz which achieves 160M pixels/sec.

Coding based Synchronization Algorithm for Secondary Synchronization Channel in WCDMA

A new code synchronization algorithm is proposed in this paper for the secondary cell-search stage in wideband CDMA systems. Rather than using the Cyclically Permutable (CP) code in the Secondary Synchronization Channel (S-SCH) to simultaneously determine the frame boundary and scrambling code group, the new synchronization algorithm implements the same function with less system complexity and less Mean Acquisition Time (MAT). The Secondary Synchronization Code (SSC) is redesigned by splitting into two sub-sequences. We treat the information of scrambling code group as data bits and use simple time diversity BCH coding for further reliability. It avoids involved and time-costly Reed-Solomon (RS) code computations and comparisons. Analysis and simulation results show that the Synchronization Error Rate (SER) yielded by the new algorithm in Rayleigh fading channels is close to that of the conventional algorithm in the standard. This new synchronization algorithm reduces system complexities, shortens the average cell-search time and can be implemented in the slot-based cell-search pipeline. By taking antenna diversity and pipelining correlation processes, the new algorithm also shows its flexible application in multiple antenna systems.