A Survey of Field Programmable Gate Array-Based Convolutional Neural Network Accelerators

With the rapid development of deep learning, neural network and deep learning algorithms play a significant role in various practical applications. Due to the high accuracy and good performance, Convolutional Neural Networks (CNNs) especially have become a research hot spot in the past few years. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses a significant challenge to construct a high-performance implementation of deep learning neural networks. Meanwhile, many of these application scenarios also have strict requirements on the performance and low-power consumption of hardware devices. Therefore, it is particularly critical to choose a moderate computing platform for hardware acceleration of CNNs. This article aimed to survey the recent advance in Field Programmable Gate Array (FPGA)-based acceleration of CNNs. Various designs and implementations of the accelerator based on FPGA under different devices and network models are overviewed, and the versions of Graphic Processing Units (GPUs), Application Specific Integrated Circuits (ASICs) and Digital Signal Processors (DSPs) are compared to present our own critical analysis and comments. Finally, we give a discussion on different perspectives of these acceleration and optimization methods on FPGA platforms to further explore the opportunities and challenges for future research. More helpfully, we give a prospect for future development of the FPGA-based accelerator.

Embedded Semantic Segmentation Network Optimized for Matrix Multiplication Accelerator

Autonomous driving systems require high reliability to provide people with a safe and comfortable driving experience. However, despite the development of a number of vehicle sensors, it is difficult to always provide high perceived performance in driving environments that vary from time to season. The image segmentation method using deep learning, which has recently evolved rapidly, provides high recognition performance in various road environments stably. However, since the system controls a vehicle in real time, a highly complex deep learning network cannot be used due to time and memory constraints. Moreover, efficient networks are optimized for GPU environments, which degrade performance in embedded processor environments equipped simple hardware accelerators. In this paper, a semantic segmentation network, matrix multiplication accelerator network (MMANet), optimized for matrix multiplication accelerator (MMA) on Texas instrument digital signal processors (TI DSP) is proposed to improve the recognition performance of autonomous driving system. The proposed method is designed to maximize the number of layers that can be performed in a limited time to provide reliable driving environment information in real time. First, the number of channels in the activation map is fixed to fit the structure of MMA. By increasing the number of parallel branches, the lack of information caused by fixing the number of channels is resolved. Second, an efficient convolution is selected depending on the size of the activation. Since MMA is a fixed, it may be more efficient for normal convolution than depthwise separable convolution depending on memory access overhead. Thus, a convolution type is decided according to output stride to increase network depth. In addition, memory access time is minimized by processing operations only in L3 cache. Lastly, reliable contexts are extracted using the extended atrous spatial pyramid pooling (ASPP). The suggested method gets stable features from an extended path by increasing the kernel size and accessing consecutive data. In addition, it consists of two ASPPs to obtain high quality contexts using the restored shape without global average pooling paths since the layer uses MMA as a simple adder. To verify the proposed method, an experiment is conducted using perfsim, a timing simulator, and the Cityscapes validation sets. The proposed network can process an image with 640 x 480 resolution for 6.67 ms, so six cameras can be used to identify the surroundings of the vehicle as 20 frame per second (FPS). In addition, it achieves 73.1% mean intersection over union (mIoU) which is the highest recognition rate among embedded networks on the Cityscapes validation set.

Analytical Comparison of Conventional Algorithms with Vedic Algorithm for Digital Multiplier

In today’s scenario, the complexity of digital signal processing (DSP) applications and various microcontroller architectures have been increasing to such an extent that the traditional approaches to multiplier design in most processors are becoming outdated for being comparatively slow. Modern processing applications require suitable pipelined approaches, and therefore, algorithms that are friendlier with pipelined architectures. Traditional algorithms like Wallace Tree, Radix-4 Booth, Radix-8 Booth, Dadda architectures have been proven to be comparatively slow for pipelined architectures. These architectures, therefore, need to be optimized or combined with other architectures amongst them to enhance its performances and to be made suitable for pipelined hardware/architectures. Recently, Vedic algorithm mathematically has proven to be efficient by appearing to be less complex and with fewer steps for its output establishment and have assumed renewed importance. This paper describes and shows how the Vedic algorithm can be better suited for pipelined architectures and also can be combined with traditional architectures and algorithms for enhancing its ability even further. In this paper, we also established that for complex applications on DSP and other microcontroller architectures, using Vedic approach for multiplication proves to be the best available and efficient option.

An Efficient Implementation of High Speed Vedic Multiplier Using Compressors for Image Processing Applications

Digital signal processor, image signal processor and FIR filters have multipliers as an important part of their design. On the basis of Vedic mathematics, Vedic multipliers have come out to be very fast multipliers. One of the image processing applications is edge detection. This research presents a small area and high speed 8 bit Vedic multiplier system comprising of compressor based adders. This results in faster edge detection. This architecture is tested on Xilinx vertex 4 FPGA board and simulations were carried out using the Xilinx synthesis tool. Comparisons are made and this system is found to be smaller in area with high speed (the lesser propagation delay). This compressor based Vedic multiplier is 1.1 times speedier than a typical Vedic multiplier. Also, this Vedic Multiplier is 2 times speedier than a ‘simple’ multiplier.

H.263 Based Video Transceiver for Wireless Camera System

In this paper, a design of H.263 based wireless video transceiver is presented for wireless camera system. It uses standard WIFI transceiver and the covering area is up to 100m. Furthermore the standard H.263 video encoding technique is used for video compression since wireless video transmitter is unable to transmit high capacity raw data in real time and the implemented system is capable of streaming at speed of less than 1Mbps using NTSC 720x480 video.

Field Programmable Gate Array Based Infinite Impulse Response Filter Using Multipliers

In this paper, an Infinite Impulse Response (IIR) filter has been designed and simulated on an Field Programmable Gate Arrays (FPGA). The implementation is based on Multiply Add and Accumulate (MAC) algorithm which uses multiply operations for design implementation. Parallel Pipelined structure is used to implement the proposed IIR Filter taking optimal advantage of the look up table of target device. The designed filter has been synthesized on Digital Signal Processor (DSP) slice based FPGA to perform multiplier function of MAC unit. The DSP slices are useful to enhance the speed performance. The proposed design is simulated with Matlab, synthesized with Xilinx Synthesis Tool, and implemented on FPGA devices. The Virtex 5 FPGA based design can operate at an estimated frequency of 81.5 MHz as compared to 40.5 MHz in case of Spartan 3 ADSP based design. The Virtex 5 based implementation also consumes less slices and slice flip flops of target FPGA in comparison to Spartan 3 ADSP based implementation to provide cost effective solution for signal processing applications.

ICT Education: Digital History Learners

This article is to review and understand the new generation of students to understand their expectations and attitudes. There are a group of students on school projects, creative work, educational software and digital signal source, the use of social networking tools to communicate with friends and a part in the competition. Today's students have been described as the new millennium students. They use information and communication technology in a more creative and innovative at home than at school, because the information and communication technologies for different purposes, in the home, usually occur in school. They collaborate and communicate more effectively when they are at home. Most children enter school, they will bring about how to use information and communication technologies, some basic skills and some tips on how to use information and communication technology will provide a more advanced than most of the school's expectations. Many teachers can help students, however, still a lot of work, "tradition", without a computer, and did not see the "new social computing networks describe young people to learn and new ways of working life in the future", in the education system of the benefits of using a computer.

Design and Analysis of a Low Power High Speed 1 Bit Full Adder Cell Based On TSPC Logic with Multi-Threshold CMOS

An adder is one of the most integral component of a digital system like a digital signal processor or a microprocessor. Being an extremely computationally intensive part of a system, the optimization for speed and power consumption of the adder is of prime importance. In this paper we have designed a 1 bit full adder cell based on dynamic TSPC logic to achieve high speed operation. A high threshold voltage sleep transistor is used to reduce the static power dissipation in standby mode. The circuit is designed and simulated in TSPICE using TSMC 180nm CMOS process. Average power consumption, delay and power-delay product is measured which showed considerable improvement in performance over the existing full adder designs.

Enhanced Gram-Schmidt Process for Improving the Stability in Signal and Image Processing

The Gram-Schmidt Process (GSP) is used to convert a non-orthogonal basis (a set of linearly independent vectors) into an orthonormal basis (a set of orthogonal, unit-length vectors). The process consists of taking each vector and then subtracting the elements in common with the previous vectors. This paper introduces an Enhanced version of the Gram-Schmidt Process (EGSP) with inverse, which is useful for signal and image processing applications.

Low Cost Real Time Robust Identification of Impulsive Signals

This paper describes an automated implementable system for impulsive signals detection and recognition. The system uses a Digital Signal Processing device for the detection and identification process. Here the system analyses the signals in real time in order to produce a particular response if needed. The system analyses the signals in real time in order to produce a specific output if needed. Detection is achieved through normalizing the inputs and comparing the read signals to a dynamic threshold and thus avoiding detections linked to loud or fluctuating environing noise. Identification is done through neuronal network algorithms. As a setup our system can receive signals to “learn” certain patterns. Through “learning” the system can recognize signals faster, inducing flexibility to new patterns similar to those known. Sound is captured through a simple jack input, and could be changed for an enhanced recording surface such as a wide-area recorder. Furthermore a communication module can be added to the apparatus to send alerts to another interface if needed.

Flexible Sensor Array with Programmable Measurement System

This study is concerned with pH solution detection using 2 × 4 flexible sensor array based on a plastic polyethylene terephthalate (PET) substrate that is coated a conductive layer and a ruthenium dioxide (RuO2) sensitive membrane with the technologies of screen-printing and RF sputtering. For data analysis, we also prepared a dynamic measurement system for acquiring the response voltage and analyzing the characteristics of the working electrodes (WEs), such as sensitivity and linearity. In this condition, an array measurement system was designed to acquire the original signal from sensor array, and it is based on the method of digital signal processing (DSP). The DSP modifies the unstable acquisition data to a direct current (DC) output using the technique of digital filter. Hence, this sensor array can obtain a satisfactory yield, 62.5%, through the design measurement and analysis system in our laboratory.

Improved Modulo 2n +1 Adder Design

Efficient modulo 2n+1 adders are important for several applications including residue number system, digital signal processors and cryptography algorithms. In this paper we present a novel modulo 2n+1 addition algorithm for a recently represented number system. The proposed approach is introduced for the reduction of the power dissipated. In a conventional modulo 2n+1 adder, all operands have (n+1)-bit length. To avoid using (n+1)-bit circuits, the diminished-1 and carry save diminished-1 number systems can be effectively used in applications. In the paper, we also derive two new architectures for designing modulo 2n+1 adder, based on n-bit ripple-carry adder. The first architecture is a faster design whereas the second one uses less hardware. In the proposed method, the special treatment required for zero operands in Diminished-1 number system is removed. In the fastest modulo 2n+1 adders in normal binary system, there are 3-operand adders. This problem is also resolved in this paper. The proposed architectures are compared with some efficient adders based on ripple-carry adder and highspeed adder. It is shown that the hardware overhead and power consumption will be reduced. As well as power reduction, in some cases, power-delay product will be also reduced.

Design of Low Power and High Speed Digital IIR Filter in 45nm with Optimized CSA for Digital Signal Processing Applications

In this paper, a design methodology to implement low-power and high-speed 2nd order recursive digital Infinite Impulse Response (IIR) filter has been proposed. Since IIR filters suffer from a large number of constant multiplications, the proposed method replaces the constant multiplications by using addition/subtraction and shift operations. The proposed new 6T adder cell is used as the Carry-Save Adder (CSA) to implement addition/subtraction operations in the design of recursive section IIR filter to reduce the propagation delay. Furthermore, high-level algorithms designed for the optimization of the number of CSA blocks are used to reduce the complexity of the IIR filter. The DSCH3 tool is used to generate the schematic of the proposed 6T CSA based shift-adds architecture design and it is analyzed by using Microwind CAD tool to synthesize low-complexity and high-speed IIR filters. The proposed design outperforms in terms of power, propagation delay, area and throughput when compared with MUX-12T, MCIT-7T based CSA adder filter design. It is observed from the experimental results that the proposed 6T based design method can find better IIR filter designs in terms of power and delay than those obtained by using efficient general multipliers.

An Approach for Blind Source Separation using the Sliding DFT and Time Domain Independent Component Analysis

''Cocktail party problem'' is well known as one of the human auditory abilities. We can recognize the specific sound that we want to listen by this ability even if a lot of undesirable sounds or noises are mixed. Blind source separation (BSS) based on independent component analysis (ICA) is one of the methods by which we can separate only a special signal from their mixed signals with simple hypothesis. In this paper, we propose an online approach for blind source separation using the sliding DFT and the time domain independent component analysis. The proposed method can reduce calculation complexity in comparison with conventional methods, and can be applied to parallel processing by using digital signal processors (DSPs) and so on. We evaluate this method and show its availability.

The Haar Wavelet Transform of the DNA Signal Representation

The Deoxyribonucleic Acid (DNA) which is a doublestranded helix of nucleotides consists of: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). In this work, we convert this genetic code into an equivalent digital signal representation. Applying a wavelet transform, such as Haar wavelet, we will be able to extract details that are not so clear in the original genetic code. We compare between different organisms using the results of the Haar wavelet Transform. This is achieved by using the trend part of the signal since the trend part bears the most energy of the digital signal representation. Consequently, we will be able to quantitatively reconstruct different biological families.

A 24-Bit, 8.1-MS/s D/A Converter for Audio Baseband Channel Applications

This paper study the high-level modelling and design of delta-sigma (ΔΣ) noise shapers for audio Digital-to-Analog Converter (DAC) so as to eliminate the in-band Signal-to-Noise- Ratio (SNR) degradation that accompany one channel mismatch in audio signal. The converter combines a cascaded digital signal interpolation, a noise-shaping single loop delta-sigma modulator with a 5-bit quantizer resolution in the final stage. To reduce sensitivity of Digital-to-Analog Converter (DAC) nonlinearities of the last stage, a high pass second order Data Weighted Averaging (R2DWA) is introduced. This paper presents a MATLAB description modelling approach of the proposed DAC architecture with low distortion and swing suppression integrator designs. The ΔΣ Modulator design can be configured as a 3rd-order and allows 24-bit PCM at sampling rate of 64 kHz for Digital Video Disc (DVD) audio application. The modeling approach provides 139.38 dB of dynamic range for a 32 kHz signal band at -1.6 dBFS input signal level.

A New Predictor of Coding Regions in Genomic Sequences using a Combination of Different Approaches

Identifying protein coding regions in DNA sequences is a basic step in the location of genes. Several approaches based on signal processing tools have been applied to solve this problem, trying to achieve more accurate predictions. This paper presents a new predictor that improves the efficacy of three techniques that use the Fourier Transform to predict coding regions, and that could be computed using an algorithm that reduces the computation load. Some ideas about the combination of the predictor with other methods are discussed. ROC curves are used to demonstrate the efficacy of the proposed predictor, based on the computation of 25 DNA sequences from three different organisms.

A Novel Digital Calibration Technique for Gain and Offset Mismatch in TIΣΔ ADCs

Time interleaved sigma-delta (TIΣΔ) architecture is a potential candidate for high bandwidth analog to digital converters (ADC) which remains a bottleneck for software and cognitive radio receivers. However, the performance of the TIΣΔ architecture is limited by the unavoidable gain and offset mismatches resulting from the manufacturing process. This paper presents a novel digital calibration method to compensate the gain and offset mismatch effect. The proposed method takes advantage of the reconstruction digital signal processing on each channel and requires only few logic components for implementation. The run time calibration is estimated to 10 and 15 clock cycles for offset cancellation and gain mismatch calibration respectively.

High-Speed Pipeline Implementation of Radix-2 DIF Algorithm

In this paper, we propose a new architecture for the implementation of the N-point Fast Fourier Transform (FFT), based on the Radix-2 Decimation in Frequency algorithm. This architecture is based on a pipeline circuit that can process a stream of samples and produce two FFT transform samples every clock cycle. Compared to existing implementations the architecture proposed achieves double processing speed using the same circuit complexity.

A Novel Low Power, High Speed 14 Transistor CMOS Full Adder Cell with 50% Improvement in Threshold Loss Problem

Full adders are important components in applications such as digital signal processors (DSP) architectures and microprocessors. In addition to its main task, which is adding two numbers, it participates in many other useful operations such as subtraction, multiplication, division,, address calculation,..etc. In most of these systems the adder lies in the critical path that determines the overall speed of the system. So enhancing the performance of the 1-bit full adder cell (the building block of the adder) is a significant goal.Demands for the low power VLSI have been pushing the development of aggressive design methodologies to reduce the power consumption drastically. To meet the growing demand, we propose a new low power adder cell by sacrificing the MOS Transistor count that reduces the serious threshold loss problem, considerably increases the speed and decreases the power when compared to the static energy recovery full (SERF) adder. So a new improved 14T CMOS l-bit full adder cell is presented in this paper. Results show 50% improvement in threshold loss problem, 45% improvement in speed and considerable power consumption over the SERF adder and other different types of adders with comparable performance.