Abstract: With the rapid development of deep learning, neural network and deep learning algorithms play a significant role in various practical applications. Due to the high accuracy and good performance, Convolutional Neural Networks (CNNs) especially have become a research hot spot in the past few years. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses a significant challenge to construct a high-performance implementation of deep learning neural networks. Meanwhile, many of these application scenarios also have strict requirements on the performance and low-power consumption of hardware devices. Therefore, it is particularly critical to choose a moderate computing platform for hardware acceleration of CNNs. This article aimed to survey the recent advance in Field Programmable Gate Array (FPGA)-based acceleration of CNNs. Various designs and implementations of the accelerator based on FPGA under different devices and network models are overviewed, and the versions of Graphic Processing Units (GPUs), Application Specific Integrated Circuits (ASICs) and Digital Signal Processors (DSPs) are compared to present our own critical analysis and comments. Finally, we give a discussion on different perspectives of these acceleration and optimization methods on FPGA platforms to further explore the opportunities and challenges for future research. More helpfully, we give a prospect for future development of the FPGA-based accelerator.
Abstract: Autonomous driving systems require high reliability to provide people with a safe and comfortable driving experience. However, despite the development of a number of vehicle sensors, it is difficult to always provide high perceived performance in driving environments that vary from time to season. The image segmentation method using deep learning, which has recently evolved rapidly, provides high recognition performance in various road environments stably. However, since the system controls a vehicle in real time, a highly complex deep learning network cannot be used due to time and memory constraints. Moreover, efficient networks are optimized for GPU environments, which degrade performance in embedded processor environments equipped simple hardware accelerators. In this paper, a semantic segmentation network, matrix multiplication accelerator network (MMANet), optimized for matrix multiplication accelerator (MMA) on Texas instrument digital signal processors (TI DSP) is proposed to improve the recognition performance of autonomous driving system. The proposed method is designed to maximize the number of layers that can be performed in a limited time to provide reliable driving environment information in real time. First, the number of channels in the activation map is fixed to fit the structure of MMA. By increasing the number of parallel branches, the lack of information caused by fixing the number of channels is resolved. Second, an efficient convolution is selected depending on the size of the activation. Since MMA is a fixed, it may be more efficient for normal convolution than depthwise separable convolution depending on memory access overhead. Thus, a convolution type is decided according to output stride to increase network depth. In addition, memory access time is minimized by processing operations only in L3 cache. Lastly, reliable contexts are extracted using the extended atrous spatial pyramid pooling (ASPP). The suggested method gets stable features from an extended path by increasing the kernel size and accessing consecutive data. In addition, it consists of two ASPPs to obtain high quality contexts using the restored shape without global average pooling paths since the layer uses MMA as a simple adder. To verify the proposed method, an experiment is conducted using perfsim, a timing simulator, and the Cityscapes validation sets. The proposed network can process an image with 640 x 480 resolution for 6.67 ms, so six cameras can be used to identify the surroundings of the vehicle as 20 frame per second (FPS). In addition, it achieves 73.1% mean intersection over union (mIoU) which is the highest recognition rate among embedded networks on the Cityscapes validation set.
Abstract: Digital signal processor, image signal processor and FIR filters have multipliers as an important part of their design. On the basis of Vedic mathematics, Vedic multipliers have come out to be very fast multipliers. One of the image processing applications is edge detection. This research presents a small area and high speed 8 bit Vedic multiplier system comprising of compressor based adders. This results in faster edge detection. This architecture is tested on Xilinx vertex 4 FPGA board and simulations were carried out using the Xilinx synthesis tool. Comparisons are made and this system is found to be smaller in area with high speed (the lesser propagation delay). This compressor based Vedic multiplier is 1.1 times speedier than a typical Vedic multiplier. Also, this Vedic Multiplier is 2 times speedier than a ‘simple’ multiplier.
Abstract: In this paper, an Infinite Impulse Response (IIR) filter
has been designed and simulated on an Field Programmable Gate
Arrays (FPGA). The implementation is based on Multiply Add and
Accumulate (MAC) algorithm which uses multiply operations for
design implementation. Parallel Pipelined structure is used to
implement the proposed IIR Filter taking optimal advantage of the
look up table of target device. The designed filter has been
synthesized on Digital Signal Processor (DSP) slice based FPGA to
perform multiplier function of MAC unit. The DSP slices are useful
to enhance the speed performance. The proposed design is simulated
with Matlab, synthesized with Xilinx Synthesis Tool, and
implemented on FPGA devices. The Virtex 5 FPGA based design can
operate at an estimated frequency of 81.5 MHz as compared to 40.5
MHz in case of Spartan 3 ADSP based design. The Virtex 5 based
implementation also consumes less slices and slice flip flops of target
FPGA in comparison to Spartan 3 ADSP based implementation to
provide cost effective solution for signal processing applications.
Abstract: An adder is one of the most integral component of a digital system like a digital signal processor or a microprocessor. Being an extremely computationally intensive part of a system, the optimization for speed and power consumption of the adder is of prime importance. In this paper we have designed a 1 bit full adder cell based on dynamic TSPC logic to achieve high speed operation. A high threshold voltage sleep transistor is used to reduce the static power dissipation in standby mode. The circuit is designed and simulated in TSPICE using TSMC 180nm CMOS process. Average power consumption, delay and power-delay product is measured which showed considerable improvement in performance over the existing full adder designs.
Abstract: ''Cocktail party problem'' is well known as one of the human auditory abilities. We can recognize the specific sound that we want to listen by this ability even if a lot of undesirable sounds or noises are mixed. Blind source separation (BSS) based on independent component analysis (ICA) is one of the methods by which we can separate only a special signal from their mixed signals with simple hypothesis. In this paper, we propose an online approach for blind source separation using the sliding DFT and the time domain independent component analysis. The proposed method can reduce calculation complexity in comparison with conventional methods, and can be applied to parallel processing by using digital signal processors (DSPs) and so on. We evaluate this method and show its availability.
Abstract: Full adders are important components in applications
such as digital signal processors (DSP) architectures and
microprocessors. In addition to its main task, which is adding two
numbers, it participates in many other useful operations such as
subtraction, multiplication, division,, address calculation,..etc. In
most of these systems the adder lies in the critical path that
determines the overall speed of the system. So enhancing the
performance of the 1-bit full adder cell (the building block of the
adder) is a significant goal.Demands for the low power VLSI have
been pushing the development of aggressive design methodologies to
reduce the power consumption drastically. To meet the growing
demand, we propose a new low power adder cell by sacrificing the
MOS Transistor count that reduces the serious threshold loss
problem, considerably increases the speed and decreases the power
when compared to the static energy recovery full (SERF) adder. So a
new improved 14T CMOS l-bit full adder cell is presented in this
paper. Results show 50% improvement in threshold loss problem,
45% improvement in speed and considerable power consumption
over the SERF adder and other different types of adders with
comparable performance.
Abstract: This paper describes the design of a voltage based maximum power point tracker (MPPT) for photovoltaic (PV) applications. Of the various MPPT methods, the voltage based method is considered to be the simplest and cost effective. The major disadvantage of this method is that the PV array is disconnected from the load for the sampling of its open circuit voltage, which inevitably results in power loss. Another disadvantage, in case of rapid irradiance variation, is that if the duration between two successive samplings, called the sampling period, is too long there is a considerable loss. This is because the output voltage of the PV array follows the unchanged reference during one sampling period. Once a maximum power point (MPP) is tracked and a change in irradiation occurs between two successive samplings, then the new MPP is not tracked until the next sampling of the PV array voltage. This paper proposes an MPPT circuit in which the sampling interval of the PV array voltage, and the sampling period have been shortened. The sample and hold circuit has also been simplified. The proposed circuit does not utilize a microcontroller or a digital signal processor and is thus suitable for low cost and low power applications.
Abstract: Intelligent traffic surveillance technology is an issue in
the field of traffic data analysis. Therefore, we need the technology to
detect moving objects in real-time while there are variations in background and natural light. In this paper, we proposed a Weighted-Center Surround Difference
method for object detection in outdoor environments. The proposed system detects objects using the saliency map that is obtained by
analyzing the weight of each layers of Gaussian pyramid. In order to validate the effectiveness of our system, we implemented the proposed
method using a digital signal processor, TMS320DM6437.
Experimental results show that blurred noisy around objects was effectively eliminated and the object detection accuracy is improved.
Abstract: This paper presents a new method for the
implementation of a direct rotor flux control (DRFOC) of induction
motor (IM) drives. It is based on the rotor flux components
regulation. The d and q axis rotor flux components feed proportional
integral (PI) controllers. The outputs of which are the target stator
voltages (vdsref and vqsref). While, the synchronous speed is depicted at
the output of rotor speed controller. In order to accomplish variable
speed operation, conventional PI like controller is commonly used.
These controllers provide limited good performances over a wide
range of operations even under ideal field oriented conditions. An
alternate approach is to use the so called fuzzy logic controller. The
overall investigated system is implemented using dSpace system
based on digital signal processor (DSP). Simulation and experimental
results have been presented for a one kw IM drives to confirm the
validity of the proposed algorithms.
Abstract: With the advent of inexpensive 32 bit floating point digital signal processor-s availability in market, many computationally intensive algorithms such as Kalman filter becomes feasible to implement in real time. Dynamic simulation of a self excited DC motor using second order state variable model and implementation of Kalman Filter in a floating point DSP TMS320C6713 is presented in this paper with an objective to introduce and implement such an algorithm, for beginners. A fractional hp DC motor is simulated in both Matlab® and DSP and the results are included. A step by step approach for simulation of DC motor in Matlab® and “C" routines in CC Studio® is also given. CC studio® project file details and environmental setting requirements are addressed. This tutorial can be used with 6713 DSK, which is based on floating point DSP and CC Studio either in hardware mode or in simulation mode.
Abstract: An efficient parallel form in digital signal processor can improve the algorithm performance. The butterfly structure is an important role in fast Fourier transform (FFT), because its symmetry form is suitable for hardware implementation. Although it can perform a symmetric structure, the performance will be reduced under the data-dependent flow characteristic. Even though recent research which call as novel memory reference reduction methods (NMRRM) for FFT focus on reduce memory reference in twiddle factor, the data-dependent property still exists. In this paper, we propose a parallel-computing approach for FFT implementation on digital signal processor (DSP) which is based on data-independent property and still hold the property of low-memory reference. The proposed method combines final two steps in NMRRM FFT to perform a novel data-independent structure, besides it is very suitable for multi-operation-unit digital signal processor and dual-core system. We have applied the proposed method of radix-2 FFT algorithm in low memory reference on TI TMSC320C64x DSP. Experimental results show the method can reduce 33.8% clock cycles comparing with the NMRRM FFT implementation and keep the low-memory reference property.
Abstract: In order to protect original data, watermarking is first consideration direction for digital information copyright. In addition, to achieve high quality image, the algorithm maybe can not run on embedded system because the computation is very complexity. However, almost nowadays algorithms need to build on consumer production because integrator circuit has a huge progress and cheap price. In this paper, we propose a novel algorithm which efficient inserts watermarking on digital image and very easy to implement on digital signal processor. In further, we select a general and cheap digital signal processor which is made by analog device company to fit consumer application. The experimental results show that the image quality by watermarking insertion can achieve 46 dB can be accepted in human vision and can real-time execute on digital signal processor.
Abstract: Unlike general-purpose processors, digital signal
processors (DSP processors) are strongly application-dependent. To
meet the needs for diverse applications, a wide variety of DSP
processors based on different architectures ranging from the
traditional to VLIW have been introduced to the market over the
years. The functionality, performance, and cost of these processors
vary over a wide range. In order to select a processor that meets the
design criteria for an application, processor performance is usually
the major concern for digital signal processing (DSP) application
developers. Performance data are also essential for the designers of
DSP processors to improve their design. Consequently, several DSP
performance benchmarks have been proposed over the past decade or
so. However, none of these benchmarks seem to have included recent
new DSP applications.
In this paper, we use a new benchmark that we recently developed
to compare the performance of popular DSP processors from Texas
Instruments and StarCore. The new benchmark is based on the
Selectable Mode Vocoder (SMV), a speech-coding program from the
recent third generation (3G) wireless voice applications. All
benchmark kernels are compiled by the compilers of the respective
DSP processors and run on their simulators. Weighted arithmetic
mean of clock cycles and arithmetic mean of code size are used to
compare the performance of five DSP processors.
In addition, we studied how the performance of a processor is
affected by code structure, features of processor architecture and
optimization of compiler. The extensive experimental data gathered,
analyzed, and presented in this paper should be helpful for DSP
processor and compiler designers to meet their specific design goals.
Abstract: In this paper, low end Digital Signal Processors (DSPs)
are applied to accelerate integer neural networks. The use of DSPs
to accelerate neural networks has been a topic of study for some
time, and has demonstrated significant performance improvements.
Recently, work has been done on integer only neural networks, which
greatly reduces hardware requirements, and thus allows for cheaper
hardware implementation. DSPs with Arithmetic Logic Units (ALUs)
that support floating or fixed point arithmetic are generally more
expensive than their integer only counterparts due to increased circuit
complexity. However if the need for floating or fixed point math
operation can be removed, then simpler, lower cost DSPs can be
used. To achieve this, an integer only neural network is created in
this paper, which is then accelerated by using DSP instructions to
improve performance.
Abstract: We present a simplified equalization technique for a
π/4 differential quadrature phase shift keying ( π/4 -DQPSK) modulated
signal in a multipath fading environment. The proposed equalizer is
realized as a fractionally spaced adaptive decision feedback equalizer
(FS-ADFE), employing exponential step-size least mean square
(LMS) algorithm as the adaptation technique. The main advantage of
the scheme stems from the usage of exponential step-size LMS algorithm
in the equalizer, which achieves similar convergence behavior
as that of a recursive least squares (RLS) algorithm with significantly
reduced computational complexity. To investigate the finite-precision
performance of the proposed equalizer along with the π/4 -DQPSK
modem, the entire system is evaluated on a 16-bit fixed point digital
signal processor (DSP) environment. The proposed scheme is found
to be attractive even for those cases where equalization is to be
performed within a restricted number of training samples.