FPGA Implementation of RSA Cryptosystem

In this paper, the hardware implementation of the RSA public-key cryptographic algorithm is presented. The RSA cryptographic algorithm is depends on the computation of repeated modular exponentials. The Montgomery algorithm is used and modified to reduce hardware resources and to achieve reasonable operating speed for FPGA. An efficient architecture for modular multiplications based on the array multiplier is proposed. We have implemented a RSA cryptosystem based on Montgomery algorithm. As a result, it is shown that proposed architecture contributes to small area and reasonable speed.

2-D Realization of WiMAX Channel Interleaver for Efficient Hardware Implementation

The direct implementation of interleaver functions in WiMAX is not hardware efficient due to presence of complex functions. Also the conventional method i.e. using memories for storing the permutation tables is silicon consuming. This work presents a 2-D transformation for WiMAX channel interleaver functions which reduces the overall hardware complexity to compute the interleaver addresses on the fly. A fully reconfigurable architecture for address generation in WiMAX channel interleaver is presented, which consume 1.1 k-gates in total. It can be configured for any block size and any modulation scheme in WiMAX. The presented architecture can run at a frequency of 200 MHz, thus fully supporting high bandwidth requirements for WiMAX.

Hardware Implementation of Stack-Based Replacement Algorithms

Block replacement algorithms to increase hit ratio have been extensively used in cache memory management. Among basic replacement schemes, LRU and FIFO have been shown to be effective replacement algorithms in terms of hit rates. In this paper, we introduce a flexible stack-based circuit which can be employed in hardware implementation of both LRU and FIFO policies. We propose a simple and efficient architecture such that stack-based replacement algorithms can be implemented without the drawbacks of the traditional architectures. The stack is modular and hence, a set of stack rows can be cascaded depending on the number of blocks in each cache set. Our circuit can be implemented in conjunction with the cache controller and static/dynamic memories to form a cache system. Experimental results exhibit that our proposed circuit provides an average value of 26% improvement in storage bits and its maximum operating frequency is increased by a factor of two

A Pipelined FSBM Hardware Architecture for HTDV-H.26x

In MPEG and H.26x standards, to eliminate the temporal redundancy we use motion estimation. Given that the motion estimation stage is very complex in terms of computational effort, a hardware implementation on a re-configurable circuit is crucial for the requirements of different real time multimedia applications. In this paper, we present hardware architecture for motion estimation based on "Full Search Block Matching" (FSBM) algorithm. This architecture presents minimum latency, maximum throughput, full utilization of hardware resources such as embedded memory blocks, and combining both pipelining and parallel processing techniques. Our design is described in VHDL language, verified by simulation and implemented in a Stratix II EP2S130F1020C4 FPGA circuit. The experiment result show that the optimum operating clock frequency of the proposed design is 89MHz which achieves 160M pixels/sec.

Design of a Neural Networks Classifier for Face Detection

Face detection and recognition has many applications in a variety of fields such as security system, videoconferencing and identification. Face classification is currently implemented in software. A hardware implementation allows real-time processing, but has higher cost and time to-market. The objective of this work is to implement a classifier based on neural networks MLP (Multi-layer Perceptron) for face detection. The MLP is used to classify face and non-face patterns. The systm is described using C language on a P4 (2.4 Ghz) to extract weight values. Then a Hardware implementation is achieved using VHDL based Methodology. We target Xilinx FPGA as the implementation support.

Parallel-computing Approach for FFT Implementation on Digital Signal Processor (DSP)

An efficient parallel form in digital signal processor can improve the algorithm performance. The butterfly structure is an important role in fast Fourier transform (FFT), because its symmetry form is suitable for hardware implementation. Although it can perform a symmetric structure, the performance will be reduced under the data-dependent flow characteristic. Even though recent research which call as novel memory reference reduction methods (NMRRM) for FFT focus on reduce memory reference in twiddle factor, the data-dependent property still exists. In this paper, we propose a parallel-computing approach for FFT implementation on digital signal processor (DSP) which is based on data-independent property and still hold the property of low-memory reference. The proposed method combines final two steps in NMRRM FFT to perform a novel data-independent structure, besides it is very suitable for multi-operation-unit digital signal processor and dual-core system. We have applied the proposed method of radix-2 FFT algorithm in low memory reference on TI TMSC320C64x DSP. Experimental results show the method can reduce 33.8% clock cycles comparing with the NMRRM FFT implementation and keep the low-memory reference property.

An FPGA Implementation of Intelligent Visual Based Fall Detection

Falling has been one of the major concerns and threats to the independence of the elderly in their daily lives. With the worldwide significant growth of the aging population, it is essential to have a promising solution of fall detection which is able to operate at high accuracy in real-time and supports large scale implementation using multiple cameras. Field Programmable Gate Array (FPGA) is a highly promising tool to be used as a hardware accelerator in many emerging embedded vision based system. Thus, it is the main objective of this paper to present an FPGA-based solution of visual based fall detection to meet stringent real-time requirements with high accuracy. The hardware architecture of visual based fall detection which utilizes the pixel locality to reduce memory accesses is proposed. By exploiting the parallel and pipeline architecture of FPGA, our hardware implementation of visual based fall detection using FGPA is able to achieve a performance of 60fps for a series of video analytical functions at VGA resolutions (640x480). The results of this work show that FPGA has great potentials and impacts in enabling large scale vision system in the future healthcare industry due to its flexibility and scalability.

An Efficient Hardware Implementation of Extended and Fast Physical Addressing in Microprocessor-Based Systems Using Programmable Logic

This paper describes an efficient hardware implementation of a new technique for interfacing the data exchange between the microprocessor-based systems and the external devices. This technique, based on the use of software/hardware system and a reduced physical address, enlarges the interfacing capacity of the microprocessor-based systems, uses the Direct Memory Access (DMA) to increases the frequency of the new bus, and improves the speed of data exchange. While using this architecture in microprocessor-based system or in computer, the input of the hardware part of our system will be connected to the bus system, and the output, which is a new bus, will be connected to an external device. The new bus is composed of a data bus, a control bus and an address bus. A Xilinx Integrated Software Environment (ISE) 7.1i has been used for the programmable logic implementation.

A New Hardware Implementation of Manchester Line Decoder

In this paper, we present a simple circuit for Manchester decoding and without using any complicated or programmable devices. This circuit can decode 90kbps of transmitted encoded data; however, greater than this transmission rate can be decoded if high speed devices were used. We also present a new method for extracting the embedded clock from Manchester data in order to use it for serial-to-parallel conversion. All of our experimental measurements have been done using simulation.

Low Complexity Regular LDPC codes for Magnetic Storage Devices

LDPC codes could be used in magnetic storage devices because of their better decoding performance compared to other error correction codes. However, their hardware implementation results in large and complex decoders. This one of the main obstacles the decoders to be incorporated in magnetic storage devices. We construct small high girth and rate 2 columnweight codes from cage graphs. Though these codes have low performance compared to higher column weight codes, they are easier to implement. The ease of implementation makes them more suitable for applications such as magnetic recording. Cages are the smallest known regular distance graphs, which give us the smallest known column-weight 2 codes given the size, girth and rate of the code.

A Novel Recursive Multiplierless Algorithm for 2-D DCT

In this paper, a recursive algorithm for the computation of 2-D DCT using Ramanujan Numbers is proposed. With this algorithm, the floating-point multiplication is completely eliminated and hence the multiplierless algorithm can be implemented using shifts and additions only. The orthogonality of the recursive kernel is well maintained through matrix factorization to reduce the computational complexity. The inherent parallel structure yields simpler programming and hardware implementation and provides log 1 2 3 2 N N-N+ additions and N N 2 log 2 shifts which is very much less complex when compared to other recent multiplierless algorithms.

Accelerating Integer Neural Networks On Low Cost DSPs

In this paper, low end Digital Signal Processors (DSPs) are applied to accelerate integer neural networks. The use of DSPs to accelerate neural networks has been a topic of study for some time, and has demonstrated significant performance improvements. Recently, work has been done on integer only neural networks, which greatly reduces hardware requirements, and thus allows for cheaper hardware implementation. DSPs with Arithmetic Logic Units (ALUs) that support floating or fixed point arithmetic are generally more expensive than their integer only counterparts due to increased circuit complexity. However if the need for floating or fixed point math operation can be removed, then simpler, lower cost DSPs can be used. To achieve this, an integer only neural network is created in this paper, which is then accelerated by using DSP instructions to improve performance.

Efficient Hardware Implementation of an Elliptic Curve Cryptographic Processor Over GF (2 163)

A new and highly efficient architecture for elliptic curve scalar point multiplication which is optimized for a binary field recommended by NIST and is well-suited for elliptic curve cryptographic (ECC) applications is presented. To achieve the maximum architectural and timing improvements we have reorganized and reordered the critical path of the Lopez-Dahab scalar point multiplication architecture such that logic structures are implemented in parallel and operations in the critical path are diverted to noncritical paths. With G=41, the proposed design is capable of performing a field multiplication over the extension field with degree 163 in 11.92 s with the maximum achievable frequency of 251 MHz on Xilinx Virtex-4 (XC4VLX200) while 22% of the chip area is occupied, where G is the digit size of the underlying digit-serial finite field multiplier.

A Novel Genetic Algorithm Designed for Hardware Implementation

A new genetic algorithm, termed the 'optimum individual monogenetic genetic algorithm' (OIMGA), is presented whose properties have been deliberately designed to be well suited to hardware implementation. Specific design criteria were to ensure fast access to the individuals in the population, to keep the required silicon area for hardware implementation to a minimum and to incorporate flexibility in the structure for the targeting of a range of applications. The first two criteria are met by retaining only the current optimum individual, thereby guaranteeing a small memory requirement that can easily be stored in fast on-chip memory. Also, OIMGA can be easily reconfigured to allow the investigation of problems that normally warrant either large GA populations or individuals many genes in length. Local convergence is achieved in OIMGA by retaining elite individuals, while population diversity is ensured by continually searching for the best individuals in fresh regions of the search space. The results given in this paper demonstrate that both the performance of OIMGA and its convergence time are superior to those of a range of existing hardware GA implementations.

A New Digital Transceiver Circuit for Asynchronous Communication

A new digital transceiver circuit for asynchronous frame detection is proposed where both the transmitter and receiver contain all digital components, thereby avoiding possible use of conventional devices like monostable multivibrators with unstable external components such as resistances and capacitances. The proposed receiver circuit, in particular, uses a combinational logic block yielding an output which changes its state as soon as the start bit of a new frame is detected. This, in turn, helps in generating an efficient receiver sampling clock. A data latching circuit is also used in the receiver to latch the recovered data bits in any new frame. The proposed receiver structure is also extended from 4- bit information to any general n data bits within a frame with a common expression for the output of the combinational logic block. Performance of the proposed hardware design is evaluated in terms of time delay, reliability and robustness in comparison with the standard schemes using monostable multivibrators. It is observed from hardware implementation that the proposed circuit achieves almost 33 percent speed up over any conventional circuit.