Abstract: Motion estimation occupies the heaviest computation in HEVC (high efficiency video coding). Many fast algorithms such as TZS (test zone search) have been proposed to reduce the computation. Still the huge computation of the motion estimation is a critical issue in the implementation of HEVC video codec. In this paper, motion estimator architecture with optimized number of PEs (processing element) is presented by exploiting early termination. It also reduces hardware size by exploiting parallel processing. The presented motion estimator architecture has 8 PEs, and it can efficiently perform TZS with very high utilization of PEs.
Abstract: An innovative approach to develop modified scaling free CORDIC based two parallel pipelined Multipath Delay Commutator (MDC) FFT and IFFT architectures for radix 22 FFT algorithm is presented. Multipliers and adders are the most important data paths in FFT and IFFT architectures. Multipliers occupy high area and consume more power. In order to optimize the area and power overhead, modified scaling-free CORDIC based complex multiplier is utilized in the proposed design. In general twiddle factor values are stored in RAM block. In the proposed work, modified scaling-free CORDIC based twiddle factor generator unit is used to generate the twiddle factor and efficient switching units are used. In addition to this, four point FFT operations are performed without complex multiplication which helps to reduce area and power in the last two stages of the pipelined architectures. The design proposed in this paper is based on multipath delay commutator method. The proposed design can be extended to any radix 2n based FFT/IFFT algorithm to improve the throughput. The work is synthesized using Synopsys design Compiler using TSMC 90-nm library. The proposed method proves to be better compared to the reference design in terms of area, throughput and power consumption. The comparative analysis of the proposed design with Xilinx FPGA platform is also discussed in the paper.
Abstract: Background detection is essential in video analyses; optimization is often needed in order to achieve real time calculation. Information gathered by dual cameras placed in the front and rear part of an Autonomous Vehicle (AV) is integrated for background detection. In this paper, real time calculation is achieved on the proposed technique by using Priority Regions (PR) and Parallel Processing together where each frame is divided into regions then and each region process is processed in parallel. PR division depends upon driver view limitations. A background detection system is built on the Temporal Difference (TD) and Gaussian Filtering (GF). Temporal Difference and Gaussian Filtering with multi threshold and sigma (weight) value are be based on PR characteristics. The experiment result is prepared on real scene. Comparison of the speed and accuracy with traditional background detection techniques, the effectiveness of PR and parallel processing are also discussed in this paper.
Abstract: High level synthesis (HLS) is a process which
generates register-transfer level design for digital systems from
behavioral description. There are many HLS algorithms and
commercial tools. However, most of these algorithms consider a
behavioral description for the system when a single token is
presented to the system. This approach does not exploit extra
hardware efficiently, especially in the design of digital filters where
common operations may exist between successive tokens. In this
paper, we modify the behavioral description to process multiple
tokens in parallel. However, this approach is unlike the full
processing that requires full hardware replication. It exploits the
presence of common operations between successive tokens. The
performance of the proposed approach is better than sequential
processing and approaches that of full parallel processing as the
hardware resources are increased.
Abstract: ''Cocktail party problem'' is well known as one of the human auditory abilities. We can recognize the specific sound that we want to listen by this ability even if a lot of undesirable sounds or noises are mixed. Blind source separation (BSS) based on independent component analysis (ICA) is one of the methods by which we can separate only a special signal from their mixed signals with simple hypothesis. In this paper, we propose an online approach for blind source separation using the sliding DFT and the time domain independent component analysis. The proposed method can reduce calculation complexity in comparison with conventional methods, and can be applied to parallel processing by using digital signal processors (DSPs) and so on. We evaluate this method and show its availability.
Abstract: This paper presents implementation of attitude controller for a small UAV using field programmable gate array (FPGA). Due to the small size constrain a miniature more compact and computationally extensive; autopilot platform is needed for such systems. More over UAV autopilot has to deal with extremely adverse situations in the shortest possible time, while accomplishing its mission. FPGAs in the recent past have rendered themselves as fast, parallel, real time, processing devices in a compact size. This work utilizes this fact and implements different attitude controllers for a small UAV in FPGA, using its parallel processing capabilities. Attitude controller is designed in MATLAB/Simulink environment. The discrete version of this controller is implemented using pipelining followed by retiming, to reduce the critical path and thereby clock period of the controller datapath. Pipelined, retimed, parallel PID controller implementation is done using rapidprototyping and testing efficient development tool of “system generator", which has been developed by Xilinx for FPGA implementation. The improved timing performance enables the controller to react abruptly to any changes made to the attitudes of UAV.
Abstract: Full search block matching algorithm is widely used for hardware implementation of motion estimators in video compression algorithms. In this paper we are proposing a new architecture, which consists of a 2D parallel processing unit and a 1D unit both working in parallel. The proposed architecture reduces both data access power and computational power which are the main causes of power consumption in integer motion estimation. It also completes the operations with nearly the same number of clock cycles as compared to a 2D systolic array architecture. In this work sum of absolute difference (SAD)-the most repeated operation in block matching, is calculated in two steps. The first step is to calculate the SAD for alternate rows by a 2D parallel unit. If the SAD calculated by the parallel unit is less than the stored minimum SAD, the SAD of the remaining rows is calculated by the 1D unit. Early termination, which stops avoidable computations has been achieved with the help of alternate rows method proposed in this paper and by finding a low initial SAD value based on motion vector prediction. Data reuse has been applied to the reference blocks in the same search area which significantly reduced the memory access.
Abstract: In this paper, an approach to reduce the computation steps required by fast neural networksfor the searching process is presented. The principle ofdivide and conquer strategy is applied through imagedecomposition. Each image is divided into small in sizesub-images and then each one is tested separately usinga fast neural network. The operation of fast neuralnetworks based on applying cross correlation in thefrequency domain between the input image and theweights of the hidden neurons. Compared toconventional and fast neural networks, experimentalresults show that a speed up ratio is achieved whenapplying this technique to locate human facesautomatically in cluttered scenes. Furthermore, fasterface detection is obtained by using parallel processingtechniques to test the resulting sub-images at the sametime using the same number of fast neural networks. Incontrast to using only fast neural networks, the speed upratio is increased with the size of the input image whenusing fast neural networks and image decomposition.
Abstract: Neural processors have shown good results for
detecting a certain character in a given input matrix. In this paper, a
new idead to speed up the operation of neural processors for character
detection is presented. Such processors are designed based on cross
correlation in the frequency domain between the input matrix and the
weights of neural networks. This approach is developed to reduce the
computation steps required by these faster neural networks for the
searching process. The principle of divide and conquer strategy is
applied through image decomposition. Each image is divided into
small in size sub-images and then each one is tested separately by
using a single faster neural processor. Furthermore, faster character
detection is obtained by using parallel processing techniques to test the
resulting sub-images at the same time using the same number of faster
neural networks. In contrast to using only faster neural processors, the
speed up ratio is increased with the size of the input image when using
faster neural processors and image decomposition. Moreover, the
problem of local subimage normalization in the frequency domain is
solved. The effect of image normalization on the speed up ratio of
character detection is discussed. Simulation results show that local
subimage normalization through weight normalization is faster than
subimage normalization in the spatial domain. The overall speed up
ratio of the detection process is increased as the normalization of
weights is done off line.
Abstract: Over the past few years, a number of efforts have
been exerted to build parallel processing systems that utilize the idle
power of LAN-s and PC-s available in many homes and corporations.
The main advantage of these approaches is that they provide cheap
parallel processing environments for those who cannot afford the
expenses of supercomputers and parallel processing hardware.
However, most of the solutions provided are not very flexible in the
use of available resources and very difficult to install and setup.
In this paper, a multi-level web-based parallel processing system
(MWPS) is designed (appendix). MWPS is based on the idea of
volunteer computing, very flexible, easy to setup and easy to use.
MWPS allows three types of subscribers: simple volunteers (single
computers), super volunteers (full networks) and end users. All of
these entities are coordinated transparently through a secure web site.
Volunteer nodes provide the required processing power needed by
the system end users. There is no limit on the number of volunteer
nodes, and accordingly the system can grow indefinitely. Both
volunteer and system users must register and subscribe. Once, they
subscribe, each entity is provided with the appropriate MWPS
components. These components are very easy to install.
Super volunteer nodes are provided with special components that
make it possible to delegate some of the load to their inner nodes.
These inner nodes may also delegate some of the load to some other
lower level inner nodes .... and so on. It is the responsibility of the
parent super nodes to coordinate the delegation process and deliver
the results back to the user.
MWPS uses a simple behavior-based scheduler that takes into
consideration the current load and previous behavior of processing
nodes. Nodes that fulfill their contracts within the expected time get a
high degree of trust. Nodes that fail to satisfy their contract get a
lower degree of trust.
MWPS is based on the .NET framework and provides the minimal
level of security expected in distributed processing environments.
Users and processing nodes are fully authenticated. Communications
and messages between nodes are very secure. The system has been
implemented using C#.
MWPS may be used by any group of people or companies to
establish a parallel processing or grid environment.
Abstract: The purpose of this paper is to propose a framework for constructing correct parallel processing programs based on Equivalent Transformation Framework (ETF). ETF regards computation as In the framework, a problem-s domain knowledge and a query are described in definite clauses, and computation is regarded as transformation of the definite clauses. Its meaning is defined by a model of the set of definite clauses, and the transformation rules generated must preserve meaning. We have proposed a parallel processing method based on “specialization", a part of operation in the transformations, which resembles substitution in logic programming. The method requires “Memo-tree", a history of specialization to maintain correctness. In this paper we proposes the new method for the specialization-base parallel processing without Memo-tree.
Abstract: Every day human life experiences new equipments
more automatic and with more abilities. So the need for faster
processors doesn-t seem to finish. Despite new architectures and
higher frequencies, a single processor is not adequate for many
applications. Parallel processing and networks are previous solutions
for this problem. The new solution to put a network of resources on a
chip is called NOC (network on a chip). The more usual topology for
NOC is mesh topology. There are several routing algorithms suitable
for this topology such as XY, fully adaptive, etc. In this paper we
have suggested a new algorithm named Intermittent X, Y (IX/Y). We
have developed the new algorithm in simulation environment to
compare delay and power consumption with elders' algorithms.
Abstract: Large-scale systems such as Grids offer
infrastructures for both data distribution and parallel processing. The
use of Grid infrastructures is a more recent issue that is already
impacting the Distributed Database Management System industry. In
DBMS, distributed query processing has emerged as a fundamental
technique for ensuring high performance in distributed databases.
Database placement is particularly important in large-scale systems
because it reduces communication costs and improves resource
usage. In this paper, we propose a dynamic database placement
policy that depends on query patterns and Grid sites capabilities. We
evaluate the performance of the proposed database placement policy
using simulations. The obtained results show that dynamic database
placement can significantly improve the performance of distributed
query processing.
Abstract: This work aims to test the application of computational fluid dynamics (CFD) modeling to fixed bed catalytic cracking reactors. Studies of CFD with a fixed bed design commonly use a regular packing with N=2 to define bed geometry. CFD allows us to obtain a more accurate view of the fluid flow and heat transfer mechanisms present in fixed bed equipment. Naphtha was used as feedstock and the reactor length was 80cm. It is divided in three sections that catalyst bed packed in the middle section of the reactor. The reaction scheme was involved one primary reaction and 24 secondary reactions. Because of high CPU times in these simulations, parallel processing have been used. In this study the coke formation process in fixed bed and empty tube reactor was simulated and coke in these reactors are compared. In addition, the effect of steam ratio and feed flow rate on coke formation was investigated.
Abstract: Mobile agent has motivated the creation of a new
methodology for parallel computing. We introduce a methodology
for the creation of parallel applications on the network. The proposed
Mobile-Agent parallel processing framework uses multiple Javamobile
Agents. Each mobile agent can travel to the specified
machine in the network to perform its tasks. We also introduce the
concept of master agent, which is Java object capable of
implementing a particular task of the target application. Master agent
is dynamically assigns the task to mobile agents. We have developed
and tested a prototype application: Mobile Agent Based Parallel
Computing. Boosted by the inherited benefits of using Java and
Mobile Agents, our proposed methodology breaks the barriers
between the environments, and could potentially exploit in a parallel
manner all the available computational resources on the network.
This paper elaborates performance issues of a mobile agent for
parallel computing.
Abstract: this paper gives a novel approach towards real-time speed estimation of multiple traffic vehicles using fuzzy logic and image processing techniques with proper arrangement of camera parameters. The described algorithm consists of several important steps. First, the background is estimated by computing median over time window of specific frames. Second, the foreground is extracted using fuzzy similarity approach (FSA) between estimated background pixels and the current frame pixels containing foreground and background. Third, the traffic lanes are divided into two parts for both direction vehicles for parallel processing. Finally, the speeds of vehicles are estimated by Maximum a Posterior Probability (MAP) estimator. True ground speed is determined by utilizing infrared sensors for three different vehicles and the results are compared to the proposed algorithm with an accuracy of ± 0.74 kmph.
Abstract: In this paper is investigated a possible
optimization of some linear algebra problems which can be
solved by parallel processing using the special arrays called
systolic arrays. In this paper are used some special types of
transformations for the designing of these arrays. We show
the characteristics of these arrays. The main focus is on
discussing the advantages of these arrays in parallel
computation of matrix product, with special approach to the
designing of systolic array for matrix multiplication.
Multiplication of large matrices requires a lot of
computational time and its complexity is O(n3 ). There are
developed many algorithms (both sequential and parallel) with
the purpose of minimizing the time of calculations. Systolic
arrays are good suited for this purpose. In this paper we show
that using an appropriate transformation implicates in finding
more optimal arrays for doing the calculations of this type.
Abstract: Local Linear Neuro-Fuzzy Models (LLNFM) like other neuro- fuzzy systems are adaptive networks and provide robust learning capabilities and are widely utilized in various applications such as pattern recognition, system identification, image processing and prediction. Local linear model tree (LOLIMOT) is a type of Takagi-Sugeno-Kang neuro fuzzy algorithm which has proven its efficiency compared with other neuro fuzzy networks in learning the nonlinear systems and pattern recognition. In this paper, a dedicated reconfigurable and parallel processing hardware for LOLIMOT algorithm and its applications are presented. This hardware realizes on-chip learning which gives it the capability to work as a standalone device in a system. The synthesis results on FPGA platforms show its potential to improve the speed at least 250 of times faster than software implemented algorithms.
Abstract: Various mechanisms providing mutual exclusion and
thread synchronization can be used to support parallel processing
within a single computer. Instead of using locks, semaphores, barriers
or other traditional approaches in this paper we focus on alternative
ways for making better use of modern multithreaded architectures
and preparing hash tables for concurrent accesses. Hash structures
will be used to demonstrate and compare two entirely different
approaches (rule based cooperation and hardware synchronization
support) to an efficient parallel implementation using traditional
locks. Comparison includes implementation details, performance
ranking and scalability issues. We aim at understanding the effects
the parallelization schemes have on the execution environment with
special focus on the memory system and memory access
characteristics.
Abstract: In this paper, 3X3 routing nodes are proposed to
provide speedup and parallel processing capability in Data Vortex
network architectures. The new design not only significantly
improves network throughput and latency, but also eliminates the
need for distributive traffic control mechanism originally embedded
among nodes and the need for nodal buffering. The cost effectiveness
is studied by a comparison study with the previously proposed 2-
input buffered networks, and considerable performance enhancement
can be achieved with similar or lower cost of hardware. Unlike
previous implementation, the network leaves small probability of
contention, therefore, the packet drop rate must be kept low for such
implementation to be feasible and attractive, and it can be achieved
with proper choice of operation conditions.