Abstract: With the rapid development of deep learning, neural network and deep learning algorithms play a significant role in various practical applications. Due to the high accuracy and good performance, Convolutional Neural Networks (CNNs) especially have become a research hot spot in the past few years. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses a significant challenge to construct a high-performance implementation of deep learning neural networks. Meanwhile, many of these application scenarios also have strict requirements on the performance and low-power consumption of hardware devices. Therefore, it is particularly critical to choose a moderate computing platform for hardware acceleration of CNNs. This article aimed to survey the recent advance in Field Programmable Gate Array (FPGA)-based acceleration of CNNs. Various designs and implementations of the accelerator based on FPGA under different devices and network models are overviewed, and the versions of Graphic Processing Units (GPUs), Application Specific Integrated Circuits (ASICs) and Digital Signal Processors (DSPs) are compared to present our own critical analysis and comments. Finally, we give a discussion on different perspectives of these acceleration and optimization methods on FPGA platforms to further explore the opportunities and challenges for future research. More helpfully, we give a prospect for future development of the FPGA-based accelerator.
Abstract: Autonomous driving systems require high reliability to provide people with a safe and comfortable driving experience. However, despite the development of a number of vehicle sensors, it is difficult to always provide high perceived performance in driving environments that vary from time to season. The image segmentation method using deep learning, which has recently evolved rapidly, provides high recognition performance in various road environments stably. However, since the system controls a vehicle in real time, a highly complex deep learning network cannot be used due to time and memory constraints. Moreover, efficient networks are optimized for GPU environments, which degrade performance in embedded processor environments equipped simple hardware accelerators. In this paper, a semantic segmentation network, matrix multiplication accelerator network (MMANet), optimized for matrix multiplication accelerator (MMA) on Texas instrument digital signal processors (TI DSP) is proposed to improve the recognition performance of autonomous driving system. The proposed method is designed to maximize the number of layers that can be performed in a limited time to provide reliable driving environment information in real time. First, the number of channels in the activation map is fixed to fit the structure of MMA. By increasing the number of parallel branches, the lack of information caused by fixing the number of channels is resolved. Second, an efficient convolution is selected depending on the size of the activation. Since MMA is a fixed, it may be more efficient for normal convolution than depthwise separable convolution depending on memory access overhead. Thus, a convolution type is decided according to output stride to increase network depth. In addition, memory access time is minimized by processing operations only in L3 cache. Lastly, reliable contexts are extracted using the extended atrous spatial pyramid pooling (ASPP). The suggested method gets stable features from an extended path by increasing the kernel size and accessing consecutive data. In addition, it consists of two ASPPs to obtain high quality contexts using the restored shape without global average pooling paths since the layer uses MMA as a simple adder. To verify the proposed method, an experiment is conducted using perfsim, a timing simulator, and the Cityscapes validation sets. The proposed network can process an image with 640 x 480 resolution for 6.67 ms, so six cameras can be used to identify the surroundings of the vehicle as 20 frame per second (FPS). In addition, it achieves 73.1% mean intersection over union (mIoU) which is the highest recognition rate among embedded networks on the Cityscapes validation set.
Abstract: The seismic forces caused by the waves created in the depths of the earth during the earthquake hit the structure and cause the building to vibrate. Creating large seismic forces will cause low-strength sections in the structure to suffer extensive surface damage. The use of new steel shear walls in steel structures has caused the strength of the building and its main members (columns) to increase due to the reduction and depreciation of seismic forces during earthquakes. In the present study, an attempt was made to evaluate a type of steel shear wall that has regular holes in the inner sheet by modeling the finite element model with Abacus software. The shear wall of the steel plate, measuring 6000 × 3000 mm (one floor) and 3 mm thickness, was modeled with four different pores with a cross-sectional area. The shear wall was dynamically subjected to a time history of 5 seconds by three accelerators, El Centro, Imperial Valley and Kobe. The results showed that increasing the distance between the geometric center of the hole and the geometric center of the inner plate in the steel shear wall (increasing the RCS index) caused the total maximum acceleration to be transferred from the perimeter of the hole to horizontal and vertical beams. The results also show that there is no direct relationship between RCS index and total acceleration in steel shear wall and RCS index is separate from the peak ground acceleration value of earthquake.
Abstract: Real time image and video processing is a demand in
many computer vision applications, e.g. video surveillance, traffic
management and medical imaging. The processing of those video
applications requires high computational power. Thus, the optimal
solution is the collaboration of CPU and hardware accelerators. In
this paper, a Canny edge detection hardware accelerator is proposed.
Edge detection is one of the basic building blocks of video and image
processing applications. It is a common block in the pre-processing
phase of image and video processing pipeline. Our presented
approach targets offloading the Canny edge detection algorithm from
processing system (PS) to programmable logic (PL) taking the
advantage of High Level Synthesis (HLS) tool flow to accelerate the
implementation on Zynq platform. The resulting implementation
enables up to a 100x performance improvement through hardware
acceleration. The CPU utilization drops down and the frame rate
jumps to 60 fps of 1080p full HD input video stream.
Abstract: The main objective of the study is focused in
producing slag based geopolymer concrete obtained with the addition
of alkali activator. Test results indicated that the reaction of silicates
in slag is based on the reaction potential of sodium hydroxide and the
formation of alumino-silicates. The study also comprises on the
evaluation of the efficiency of polymer reaction in terms of the
strength gain properties for different geopolymer mixtures.
Geopolymer mixture proportions were designed for different binder
to total aggregate ratio (0.3 & 0.45) and fine to coarse aggregate ratio
(0.4 & 0.8). Geopolymer concrete specimens casted with normal
curing conditions reported a maximum 28 days compressive strength
of 54.75 MPa. The addition of glued steel fibres at 1.0% Vf in
geopolymer concrete showed reasonable improvements on the
compressive strength, split tensile strength and flexural properties of
different geopolymer mixtures. Further, comparative assessment was
made for different geopolymer mixtures and the reinforcing effects of
steel fibres were investigated in different concrete matrix.
Abstract: Falls are the primary cause of accidents in people over
the age of 65, and frequently lead to serious injuries. Since the early
detection of falls is an important step to alert and protect the aging
population, a variety of research on detecting falls was carried out
including the use of accelerators, gyroscopes and tilt sensors. In
exiting studies, falls were detected using an accelerometer with
errors. In this study, the proposed method for detecting falls was to
use two accelerometers to reject wrong falls detection. As falls are
accompanied by the acceleration of gravity and rotational motion, the
falls in this study were detected by using the z-axial acceleration
differences between two sites. The falls were detected by calculating
the difference between the analyses of accelerometers placed on two
different positions on the chest of the subject. The parameters of the
maximum difference of accelerations (diff_Z) and the integration of
accelerations in a defined region (Sum_diff_Z) were used to form the
fall detection algorithm. The falls and the activities of daily living
(ADL) could be distinguished by using the proposed parameters
without errors in spite of the impact and the change in the positions
of the accelerometers. By comparing each of the axial accelerations,
the directions of falls and the condition of the subject afterwards
could be determined.In this study, by using two accelerometers
without errors attached to two sites to detect falls, the usefulness of
the proposed fall detection algorithm parameters, diff_Z and
Sum_diff_Z, were confirmed.
Abstract: Modular multiplication is the basic operation
in most public key cryptosystems, such as RSA, DSA, ECC,
and DH key exchange. Unfortunately, very large operands
(in order of 1024 or 2048 bits) must be used to provide
sufficient security strength. The use of such big numbers
dramatically slows down the whole cipher system, especially
when running on embedded processors.
So far, customized hardware accelerators - developed on
FPGAs or ASICs - were the best choice for accelerating
modular multiplication in embedded environments. On the
other hand, many algorithms have been developed to speed
up such operations. Examples are the Montgomery modular
multiplication and the interleaved modular multiplication
algorithms. Combining both customized hardware with
an efficient algorithm is expected to provide a much faster
cipher system.
This paper introduces an enhanced architecture for computing
the modular multiplication of two large numbers X
and Y modulo a given modulus M. The proposed design is
compared with three previous architectures depending on
carry save adders and look up tables. Look up tables should
be loaded with a set of pre-computed values. Our proposed
architecture uses the same carry save addition, but replaces
both look up tables and pre-computations with an enhanced
version of sign detection techniques. The proposed architecture
supports higher frequencies than other architectures.
It also has a better overall absolute time for a single operation.