Abstract: With the rapid development of deep learning, neural network and deep learning algorithms play a significant role in various practical applications. Due to the high accuracy and good performance, Convolutional Neural Networks (CNNs) especially have become a research hot spot in the past few years. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses a significant challenge to construct a high-performance implementation of deep learning neural networks. Meanwhile, many of these application scenarios also have strict requirements on the performance and low-power consumption of hardware devices. Therefore, it is particularly critical to choose a moderate computing platform for hardware acceleration of CNNs. This article aimed to survey the recent advance in Field Programmable Gate Array (FPGA)-based acceleration of CNNs. Various designs and implementations of the accelerator based on FPGA under different devices and network models are overviewed, and the versions of Graphic Processing Units (GPUs), Application Specific Integrated Circuits (ASICs) and Digital Signal Processors (DSPs) are compared to present our own critical analysis and comments. Finally, we give a discussion on different perspectives of these acceleration and optimization methods on FPGA platforms to further explore the opportunities and challenges for future research. More helpfully, we give a prospect for future development of the FPGA-based accelerator.
Abstract: Autonomous driving systems require high reliability to provide people with a safe and comfortable driving experience. However, despite the development of a number of vehicle sensors, it is difficult to always provide high perceived performance in driving environments that vary from time to season. The image segmentation method using deep learning, which has recently evolved rapidly, provides high recognition performance in various road environments stably. However, since the system controls a vehicle in real time, a highly complex deep learning network cannot be used due to time and memory constraints. Moreover, efficient networks are optimized for GPU environments, which degrade performance in embedded processor environments equipped simple hardware accelerators. In this paper, a semantic segmentation network, matrix multiplication accelerator network (MMANet), optimized for matrix multiplication accelerator (MMA) on Texas instrument digital signal processors (TI DSP) is proposed to improve the recognition performance of autonomous driving system. The proposed method is designed to maximize the number of layers that can be performed in a limited time to provide reliable driving environment information in real time. First, the number of channels in the activation map is fixed to fit the structure of MMA. By increasing the number of parallel branches, the lack of information caused by fixing the number of channels is resolved. Second, an efficient convolution is selected depending on the size of the activation. Since MMA is a fixed, it may be more efficient for normal convolution than depthwise separable convolution depending on memory access overhead. Thus, a convolution type is decided according to output stride to increase network depth. In addition, memory access time is minimized by processing operations only in L3 cache. Lastly, reliable contexts are extracted using the extended atrous spatial pyramid pooling (ASPP). The suggested method gets stable features from an extended path by increasing the kernel size and accessing consecutive data. In addition, it consists of two ASPPs to obtain high quality contexts using the restored shape without global average pooling paths since the layer uses MMA as a simple adder. To verify the proposed method, an experiment is conducted using perfsim, a timing simulator, and the Cityscapes validation sets. The proposed network can process an image with 640 x 480 resolution for 6.67 ms, so six cameras can be used to identify the surroundings of the vehicle as 20 frame per second (FPS). In addition, it achieves 73.1% mean intersection over union (mIoU) which is the highest recognition rate among embedded networks on the Cityscapes validation set.
Abstract: In today’s scenario, the complexity of digital signal processing (DSP) applications and various microcontroller architectures have been increasing to such an extent that the traditional approaches to multiplier design in most processors are becoming outdated for being comparatively slow. Modern processing applications require suitable pipelined approaches, and therefore, algorithms that are friendlier with pipelined architectures. Traditional algorithms like Wallace Tree, Radix-4 Booth, Radix-8 Booth, Dadda architectures have been proven to be comparatively slow for pipelined architectures. These architectures, therefore, need to be optimized or combined with other architectures amongst them to enhance its performances and to be made suitable for pipelined hardware/architectures. Recently, Vedic algorithm mathematically has proven to be efficient by appearing to be less complex and with fewer steps for its output establishment and have assumed renewed importance. This paper describes and shows how the Vedic algorithm can be better suited for pipelined architectures and also can be combined with traditional architectures and algorithms for enhancing its ability even further. In this paper, we also established that for complex applications on DSP and other microcontroller architectures, using Vedic approach for multiplication proves to be the best available and efficient option.
Abstract: Digital signal processor, image signal processor and FIR filters have multipliers as an important part of their design. On the basis of Vedic mathematics, Vedic multipliers have come out to be very fast multipliers. One of the image processing applications is edge detection. This research presents a small area and high speed 8 bit Vedic multiplier system comprising of compressor based adders. This results in faster edge detection. This architecture is tested on Xilinx vertex 4 FPGA board and simulations were carried out using the Xilinx synthesis tool. Comparisons are made and this system is found to be smaller in area with high speed (the lesser propagation delay). This compressor based Vedic multiplier is 1.1 times speedier than a typical Vedic multiplier. Also, this Vedic Multiplier is 2 times speedier than a ‘simple’ multiplier.
Abstract: In this paper, a design of H.263 based wireless video
transceiver is presented for wireless camera system. It uses standard
WIFI transceiver and the covering area is up to 100m. Furthermore the
standard H.263 video encoding technique is used for video
compression since wireless video transmitter is unable to transmit high
capacity raw data in real time and the implemented system is capable
of streaming at speed of less than 1Mbps using NTSC 720x480 video.
Abstract: In this paper, an Infinite Impulse Response (IIR) filter
has been designed and simulated on an Field Programmable Gate
Arrays (FPGA). The implementation is based on Multiply Add and
Accumulate (MAC) algorithm which uses multiply operations for
design implementation. Parallel Pipelined structure is used to
implement the proposed IIR Filter taking optimal advantage of the
look up table of target device. The designed filter has been
synthesized on Digital Signal Processor (DSP) slice based FPGA to
perform multiplier function of MAC unit. The DSP slices are useful
to enhance the speed performance. The proposed design is simulated
with Matlab, synthesized with Xilinx Synthesis Tool, and
implemented on FPGA devices. The Virtex 5 FPGA based design can
operate at an estimated frequency of 81.5 MHz as compared to 40.5
MHz in case of Spartan 3 ADSP based design. The Virtex 5 based
implementation also consumes less slices and slice flip flops of target
FPGA in comparison to Spartan 3 ADSP based implementation to
provide cost effective solution for signal processing applications.
Abstract: This article is to review and understand the new
generation of students to understand their expectations and attitudes.
There are a group of students on school projects, creative work,
educational software and digital signal source, the use of social
networking tools to communicate with friends and a part in the
competition. Today's students have been described as the new
millennium students. They use information and communication
technology in a more creative and innovative at home than at school,
because the information and communication technologies for
different purposes, in the home, usually occur in school. They
collaborate and communicate more effectively when they are at
home. Most children enter school, they will bring about how to use
information and communication technologies, some basic skills and
some tips on how to use information and communication technology
will provide a more advanced than most of the school's expectations.
Many teachers can help students, however, still a lot of work,
"tradition", without a computer, and did not see the "new social
computing networks describe young people to learn and new ways of
working life in the future", in the education system of the benefits of
using a computer.
Abstract: An adder is one of the most integral component of a digital system like a digital signal processor or a microprocessor. Being an extremely computationally intensive part of a system, the optimization for speed and power consumption of the adder is of prime importance. In this paper we have designed a 1 bit full adder cell based on dynamic TSPC logic to achieve high speed operation. A high threshold voltage sleep transistor is used to reduce the static power dissipation in standby mode. The circuit is designed and simulated in TSPICE using TSMC 180nm CMOS process. Average power consumption, delay and power-delay product is measured which showed considerable improvement in performance over the existing full adder designs.
Abstract: The Gram-Schmidt Process (GSP) is used to convert a non-orthogonal basis (a set of linearly independent vectors) into an orthonormal basis (a set of orthogonal, unit-length vectors). The process consists of taking each vector and then subtracting the
elements in common with the previous vectors. This paper introduces an Enhanced version of the Gram-Schmidt Process (EGSP) with inverse, which is useful for signal and image processing applications.
Abstract: This paper describes an automated implementable
system for impulsive signals detection and recognition. The system
uses a Digital Signal Processing device for the detection and
identification process. Here the system analyses the signals in real
time in order to produce a particular response if needed. The system
analyses the signals in real time in order to produce a specific output
if needed. Detection is achieved through normalizing the inputs and
comparing the read signals to a dynamic threshold and thus avoiding
detections linked to loud or fluctuating environing noise.
Identification is done through neuronal network algorithms. As a
setup our system can receive signals to “learn” certain patterns.
Through “learning” the system can recognize signals faster, inducing
flexibility to new patterns similar to those known. Sound is captured
through a simple jack input, and could be changed for an enhanced
recording surface such as a wide-area recorder. Furthermore a
communication module can be added to the apparatus to send alerts
to another interface if needed.
Abstract: This study is concerned with pH solution detection
using 2 × 4 flexible sensor array based on a plastic polyethylene
terephthalate (PET) substrate that is coated a conductive layer and a
ruthenium dioxide (RuO2) sensitive membrane with the technologies
of screen-printing and RF sputtering. For data analysis, we also
prepared a dynamic measurement system for acquiring the response
voltage and analyzing the characteristics of the working electrodes
(WEs), such as sensitivity and linearity. In this condition, an array
measurement system was designed to acquire the original signal from
sensor array, and it is based on the method of digital signal processing
(DSP). The DSP modifies the unstable acquisition data to a direct
current (DC) output using the technique of digital filter. Hence, this
sensor array can obtain a satisfactory yield, 62.5%, through the design
measurement and analysis system in our laboratory.
Abstract: Efficient modulo 2n+1 adders are important for
several applications including residue number system, digital signal
processors and cryptography algorithms. In this paper we present a
novel modulo 2n+1 addition algorithm for a recently represented
number system. The proposed approach is introduced for the
reduction of the power dissipated. In a conventional modulo 2n+1
adder, all operands have (n+1)-bit length. To avoid using (n+1)-bit
circuits, the diminished-1 and carry save diminished-1 number
systems can be effectively used in applications. In the paper, we also
derive two new architectures for designing modulo 2n+1 adder, based
on n-bit ripple-carry adder. The first architecture is a faster design
whereas the second one uses less hardware. In the proposed method,
the special treatment required for zero operands in Diminished-1
number system is removed. In the fastest modulo 2n+1 adders in
normal binary system, there are 3-operand adders. This problem is
also resolved in this paper. The proposed architectures are compared
with some efficient adders based on ripple-carry adder and highspeed
adder. It is shown that the hardware overhead and power
consumption will be reduced. As well as power reduction, in some
cases, power-delay product will be also reduced.
Abstract: In this paper, a design methodology to implement low-power and high-speed 2nd order recursive digital Infinite Impulse Response (IIR) filter has been proposed. Since IIR filters suffer from a large number of constant multiplications, the proposed method replaces the constant multiplications by using addition/subtraction and shift operations. The proposed new 6T adder cell is used as the Carry-Save Adder (CSA) to implement addition/subtraction operations in the design of recursive section IIR filter to reduce the propagation delay. Furthermore, high-level algorithms designed for the optimization of the number of CSA blocks are used to reduce the complexity of the IIR filter. The DSCH3 tool is used to generate the schematic of the proposed 6T CSA based shift-adds architecture design and it is analyzed by using Microwind CAD tool to synthesize low-complexity and high-speed IIR filters. The proposed design outperforms in terms of power, propagation delay, area and throughput when compared with MUX-12T, MCIT-7T based CSA adder filter design. It is observed from the experimental results that the proposed 6T based design method can find better IIR filter designs in terms of power and delay than those obtained by using efficient general multipliers.
Abstract: ''Cocktail party problem'' is well known as one of the human auditory abilities. We can recognize the specific sound that we want to listen by this ability even if a lot of undesirable sounds or noises are mixed. Blind source separation (BSS) based on independent component analysis (ICA) is one of the methods by which we can separate only a special signal from their mixed signals with simple hypothesis. In this paper, we propose an online approach for blind source separation using the sliding DFT and the time domain independent component analysis. The proposed method can reduce calculation complexity in comparison with conventional methods, and can be applied to parallel processing by using digital signal processors (DSPs) and so on. We evaluate this method and show its availability.
Abstract: The Deoxyribonucleic Acid (DNA) which is a doublestranded helix of nucleotides consists of: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). In this work, we convert this genetic code into an equivalent digital signal representation. Applying a wavelet transform, such as Haar wavelet, we will be able to extract details that are not so clear in the original genetic code. We compare between different organisms using the results of the Haar wavelet Transform. This is achieved by using the trend part of the signal since the trend part bears the most energy of the digital signal representation. Consequently, we will be able to quantitatively reconstruct different biological families.
Abstract: This paper study the high-level modelling and design
of delta-sigma (ΔΣ) noise shapers for audio Digital-to-Analog
Converter (DAC) so as to eliminate the in-band Signal-to-Noise-
Ratio (SNR) degradation that accompany one channel mismatch in
audio signal. The converter combines a cascaded digital signal
interpolation, a noise-shaping single loop delta-sigma modulator with
a 5-bit quantizer resolution in the final stage. To reduce sensitivity of
Digital-to-Analog Converter (DAC) nonlinearities of the last stage, a
high pass second order Data Weighted Averaging (R2DWA) is
introduced. This paper presents a MATLAB description modelling
approach of the proposed DAC architecture with low distortion and
swing suppression integrator designs. The ΔΣ Modulator design can
be configured as a 3rd-order and allows 24-bit PCM at sampling rate
of 64 kHz for Digital Video Disc (DVD) audio application. The
modeling approach provides 139.38 dB of dynamic range for a 32
kHz signal band at -1.6 dBFS input signal level.
Abstract: Identifying protein coding regions in DNA sequences is a basic step in the location of genes. Several approaches based on signal processing tools have been applied to solve this problem, trying to achieve more accurate predictions. This paper presents a new predictor that improves the efficacy of three techniques that use the Fourier Transform to predict coding regions, and that could be computed using an algorithm that reduces the computation load. Some ideas about the combination of the predictor with other methods are discussed. ROC curves are used to demonstrate the efficacy of the proposed predictor, based on the computation of 25 DNA sequences from three different organisms.
Abstract: Time interleaved sigma-delta (TIΣΔ) architecture is a
potential candidate for high bandwidth analog to digital converters
(ADC) which remains a bottleneck for software and cognitive radio
receivers. However, the performance of the TIΣΔ architecture is
limited by the unavoidable gain and offset mismatches resulting
from the manufacturing process. This paper presents a novel digital
calibration method to compensate the gain and offset mismatch
effect. The proposed method takes advantage of the reconstruction
digital signal processing on each channel and requires only few logic
components for implementation. The run time calibration is estimated
to 10 and 15 clock cycles for offset cancellation and gain mismatch
calibration respectively.
Abstract: In this paper, we propose a new architecture for the implementation of the N-point Fast Fourier Transform (FFT), based on the Radix-2 Decimation in Frequency algorithm. This architecture is based on a pipeline circuit that can process a stream of samples and produce two FFT transform samples every clock cycle. Compared to existing implementations the architecture proposed achieves double processing speed using the same circuit complexity.
Abstract: Full adders are important components in applications
such as digital signal processors (DSP) architectures and
microprocessors. In addition to its main task, which is adding two
numbers, it participates in many other useful operations such as
subtraction, multiplication, division,, address calculation,..etc. In
most of these systems the adder lies in the critical path that
determines the overall speed of the system. So enhancing the
performance of the 1-bit full adder cell (the building block of the
adder) is a significant goal.Demands for the low power VLSI have
been pushing the development of aggressive design methodologies to
reduce the power consumption drastically. To meet the growing
demand, we propose a new low power adder cell by sacrificing the
MOS Transistor count that reduces the serious threshold loss
problem, considerably increases the speed and decreases the power
when compared to the static energy recovery full (SERF) adder. So a
new improved 14T CMOS l-bit full adder cell is presented in this
paper. Results show 50% improvement in threshold loss problem,
45% improvement in speed and considerable power consumption
over the SERF adder and other different types of adders with
comparable performance.