Abstract: In this paper, the hardware implementation of the
RSA public-key cryptographic algorithm is presented. The RSA
cryptographic algorithm is depends on the computation of repeated
modular exponentials.
The Montgomery algorithm is used and modified to reduce
hardware resources and to achieve reasonable operating speed for
FPGA. An efficient architecture for modular multiplications based on
the array multiplier is proposed. We have implemented a RSA
cryptosystem based on Montgomery algorithm. As a result, it is
shown that proposed architecture contributes to small area and
reasonable speed.
Abstract: The direct implementation of interleaver functions
in WiMAX is not hardware efficient due to presence of complex
functions. Also the conventional method i.e. using memories for
storing the permutation tables is silicon consuming. This work
presents a 2-D transformation for WiMAX channel interleaver
functions which reduces the overall hardware complexity to
compute the interleaver addresses on the fly. A fully reconfigurable
architecture for address generation in WiMAX
channel interleaver is presented, which consume 1.1 k-gates in
total. It can be configured for any block size and any modulation
scheme in WiMAX. The presented architecture can run at a
frequency of 200 MHz, thus fully supporting high bandwidth
requirements for WiMAX.
Abstract: Block replacement algorithms to increase hit ratio
have been extensively used in cache memory management. Among
basic replacement schemes, LRU and FIFO have been shown to be
effective replacement algorithms in terms of hit rates. In this paper,
we introduce a flexible stack-based circuit which can be employed in
hardware implementation of both LRU and FIFO policies. We
propose a simple and efficient architecture such that stack-based
replacement algorithms can be implemented without the drawbacks
of the traditional architectures. The stack is modular and hence, a set
of stack rows can be cascaded depending on the number of blocks in
each cache set. Our circuit can be implemented in conjunction with
the cache controller and static/dynamic memories to form a cache
system. Experimental results exhibit that our proposed circuit
provides an average value of 26% improvement in storage bits and its
maximum operating frequency is increased by a factor of two
Abstract: In MPEG and H.26x standards, to eliminate the
temporal redundancy we use motion estimation. Given that the
motion estimation stage is very complex in terms of computational
effort, a hardware implementation on a re-configurable circuit is
crucial for the requirements of different real time multimedia
applications. In this paper, we present hardware architecture for
motion estimation based on "Full Search Block Matching" (FSBM)
algorithm. This architecture presents minimum latency, maximum
throughput, full utilization of hardware resources such as embedded
memory blocks, and combining both pipelining and parallel
processing techniques. Our design is described in VHDL language,
verified by simulation and implemented in a Stratix II
EP2S130F1020C4 FPGA circuit. The experiment result show that the
optimum operating clock frequency of the proposed design is 89MHz
which achieves 160M pixels/sec.
Abstract: Face detection and recognition has many applications
in a variety of fields such as security system, videoconferencing and
identification. Face classification is currently implemented in
software. A hardware implementation allows real-time processing,
but has higher cost and time to-market.
The objective of this work is to implement a classifier based on
neural networks MLP (Multi-layer Perceptron) for face detection.
The MLP is used to classify face and non-face patterns. The systm is
described using C language on a P4 (2.4 Ghz) to extract weight
values. Then a Hardware implementation is achieved using VHDL
based Methodology. We target Xilinx FPGA as the implementation
support.
Abstract: An efficient parallel form in digital signal processor can improve the algorithm performance. The butterfly structure is an important role in fast Fourier transform (FFT), because its symmetry form is suitable for hardware implementation. Although it can perform a symmetric structure, the performance will be reduced under the data-dependent flow characteristic. Even though recent research which call as novel memory reference reduction methods (NMRRM) for FFT focus on reduce memory reference in twiddle factor, the data-dependent property still exists. In this paper, we propose a parallel-computing approach for FFT implementation on digital signal processor (DSP) which is based on data-independent property and still hold the property of low-memory reference. The proposed method combines final two steps in NMRRM FFT to perform a novel data-independent structure, besides it is very suitable for multi-operation-unit digital signal processor and dual-core system. We have applied the proposed method of radix-2 FFT algorithm in low memory reference on TI TMSC320C64x DSP. Experimental results show the method can reduce 33.8% clock cycles comparing with the NMRRM FFT implementation and keep the low-memory reference property.
Abstract: Falling has been one of the major concerns and threats
to the independence of the elderly in their daily lives. With the
worldwide significant growth of the aging population, it is essential
to have a promising solution of fall detection which is able to operate
at high accuracy in real-time and supports large scale implementation
using multiple cameras. Field Programmable Gate Array (FPGA) is a
highly promising tool to be used as a hardware accelerator in many
emerging embedded vision based system. Thus, it is the main
objective of this paper to present an FPGA-based solution of visual
based fall detection to meet stringent real-time requirements with
high accuracy. The hardware architecture of visual based fall
detection which utilizes the pixel locality to reduce memory accesses
is proposed. By exploiting the parallel and pipeline architecture of
FPGA, our hardware implementation of visual based fall detection
using FGPA is able to achieve a performance of 60fps for a series of
video analytical functions at VGA resolutions (640x480). The results
of this work show that FPGA has great potentials and impacts in
enabling large scale vision system in the future healthcare industry
due to its flexibility and scalability.
Abstract: This paper describes an efficient hardware implementation of a new technique for interfacing the data exchange between the microprocessor-based systems and the external devices. This technique, based on the use of software/hardware system and a reduced physical address, enlarges the interfacing capacity of the microprocessor-based systems, uses the Direct Memory Access (DMA) to increases the frequency of the new bus, and improves the speed of data exchange. While using this architecture in microprocessor-based system or in computer, the input of the hardware part of our system will be connected to the bus system, and the output, which is a new bus, will be connected to an external device. The new bus is composed of a data bus, a control bus and an address bus. A Xilinx Integrated Software Environment (ISE) 7.1i has been used for the programmable logic implementation.
Abstract: In this paper, we present a simple circuit for
Manchester decoding and without using any complicated or
programmable devices. This circuit can decode 90kbps of transmitted
encoded data; however, greater than this transmission rate can be
decoded if high speed devices were used. We also present a new
method for extracting the embedded clock from Manchester data in
order to use it for serial-to-parallel conversion. All of our
experimental measurements have been done using simulation.
Abstract: LDPC codes could be used in magnetic storage devices because of their better decoding performance compared to other error correction codes. However, their hardware implementation results in large and complex decoders. This one of the main obstacles the decoders to be incorporated in magnetic storage devices. We construct small high girth and rate 2 columnweight codes from cage graphs. Though these codes have low performance compared to higher column weight codes, they are easier to implement. The ease of implementation makes them more suitable for applications such as magnetic recording. Cages are the smallest known regular distance graphs, which give us the smallest known column-weight 2 codes given the size, girth and rate of the code.
Abstract: In this paper, a recursive algorithm for the
computation of 2-D DCT using Ramanujan Numbers is proposed.
With this algorithm, the floating-point multiplication is completely
eliminated and hence the multiplierless algorithm can be
implemented using shifts and additions only. The orthogonality of
the recursive kernel is well maintained through matrix factorization
to reduce the computational complexity. The inherent parallel
structure yields simpler programming and hardware implementation
and provides
log 1
2
3
2 N N-N+
additions and
N N
2 log
2 shifts which is
very much less complex when compared to other recent multiplierless
algorithms.
Abstract: In this paper, low end Digital Signal Processors (DSPs)
are applied to accelerate integer neural networks. The use of DSPs
to accelerate neural networks has been a topic of study for some
time, and has demonstrated significant performance improvements.
Recently, work has been done on integer only neural networks, which
greatly reduces hardware requirements, and thus allows for cheaper
hardware implementation. DSPs with Arithmetic Logic Units (ALUs)
that support floating or fixed point arithmetic are generally more
expensive than their integer only counterparts due to increased circuit
complexity. However if the need for floating or fixed point math
operation can be removed, then simpler, lower cost DSPs can be
used. To achieve this, an integer only neural network is created in
this paper, which is then accelerated by using DSP instructions to
improve performance.
Abstract: A new and highly efficient architecture for elliptic curve scalar point multiplication which is optimized for a binary field recommended by NIST and is well-suited for elliptic curve cryptographic (ECC) applications is presented. To achieve the maximum architectural and timing improvements we have reorganized and reordered the critical path of the Lopez-Dahab scalar point multiplication architecture such that logic structures are implemented in parallel and operations in the critical path are diverted to noncritical paths. With G=41, the proposed design is capable of performing a field multiplication over the extension field with degree 163 in 11.92 s with the maximum achievable frequency of 251 MHz on Xilinx Virtex-4 (XC4VLX200) while 22% of the chip area is occupied, where G is the digit size of the underlying digit-serial finite field multiplier.
Abstract: A new genetic algorithm, termed the 'optimum individual monogenetic genetic algorithm' (OIMGA), is presented whose properties have been deliberately designed to be well suited to hardware implementation. Specific design criteria were to ensure fast access to the individuals in the population, to keep the required silicon area for hardware implementation to a minimum and to incorporate flexibility in the structure for the targeting of a range of applications. The first two criteria are met by retaining only the current optimum individual, thereby guaranteeing a small memory requirement that can easily be stored in fast on-chip memory. Also, OIMGA can be easily reconfigured to allow the investigation of problems that normally warrant either large GA populations or individuals many genes in length. Local convergence is achieved in OIMGA by retaining elite individuals, while population diversity is ensured by continually searching for the best individuals in fresh regions of the search space. The results given in this paper demonstrate that both the performance of OIMGA and its convergence time are superior to those of a range of existing hardware GA implementations.
Abstract: A new digital transceiver circuit for asynchronous frame detection is proposed where both the transmitter and receiver contain all digital components, thereby avoiding possible use of conventional devices like monostable multivibrators with unstable external components such as resistances and capacitances. The proposed receiver circuit, in particular, uses a combinational logic block yielding an output which changes its state as soon as the start bit of a new frame is detected. This, in turn, helps in generating an efficient receiver sampling clock. A data latching circuit is also used in the receiver to latch the recovered data bits in any new frame. The proposed receiver structure is also extended from 4- bit information to any general n data bits within a frame with a common expression for the output of the combinational logic block. Performance of the proposed hardware design is evaluated in terms of time delay, reliability and robustness in comparison with the standard schemes using monostable multivibrators. It is observed from hardware implementation that the proposed circuit achieves almost 33 percent speed up over any conventional circuit.