Abstract: Scheduling algorithms are used in operating systems
to optimize the usage of processors. One of the most efficient
algorithms for scheduling is Multi-Layer Feedback Queue (MLFQ)
algorithm which uses several queues with different quanta. The most
important weakness of this method is the inability to define the
optimized the number of the queues and quantum of each queue. This
weakness has been improved in IMLFQ scheduling algorithm.
Number of the queues and quantum of each queue affect the response
time directly. In this paper, we review the IMLFQ algorithm for
solving these problems and minimizing the response time. In this
algorithm Recurrent Neural Network has been utilized to find both
the number of queues and the optimized quantum of each queue.
Also in order to prevent any probable faults in processes' response
time computation, a new fault tolerant approach has been presented.
In this approach we use combinational software redundancy to
prevent the any probable faults. The experimental results show that
using the IMLFQ algorithm results in better response time in
comparison with other scheduling algorithms also by using fault
tolerant mechanism we improve IMLFQ performance.
Abstract: A lot of Scientific and Engineering problems require the solution of large systems of linear equations of the form bAx in an effective manner. LU-Decomposition offers good choices for solving this problem. Our approach is to find the lower bound of processing elements needed for this purpose. Here is used the so called Omega calculus, as a computational method for solving problems via their corresponding Diophantine relation. From the corresponding algorithm is formed a system of linear diophantine equalities using the domain of computation which is given by the set of lattice points inside the polyhedron. Then is run the Mathematica program DiophantineGF.m. This program calculates the generating function from which is possible to find the number of solutions to the system of Diophantine equalities, which in fact gives the lower bound for the number of processors needed for the corresponding algorithm. There is given a mathematical explanation of the problem as well. Keywordsgenerating function, lattice points in polyhedron, lower bound of processor elements, system of Diophantine equationsand : calculus.
Abstract: Fully customized hardware based technology provides high performance and low power consumption by specializing the tasks in hardware but lacks design flexibility since any kind of changes require re-design and re-fabrication. Software based solutions operate with software instructions due to which a great flexibility is achieved from the easy development and maintenance of the software code. But this execution of instructions introduces a high overhead in performance and area consumption. In past few decades the reconfigurable computing domain has been introduced which overcomes the traditional trades-off between flexibility and performance and is able to achieve high performance while maintaining a good flexibility. The dramatic gains in terms of chip performance and design flexibility achieved through the reconfigurable computing systems are greatly dependent on the design of their computational units being integrated with reconfigurable logic resources. The computational unit of any reconfigurable system plays vital role in defining its strength. In this research paper an RFU based computational unit design has been presented using the tightly coupled, multi-threaded reconfigurable cores. The proposed design has been simulated for VLIW based architectures and a high gain in performance has been observed as compared to the conventional computing systems.
Abstract: Decrease in hardware costs and advances in computer
networking technologies have led to increased interest in the use of
large-scale parallel and distributed computing systems. One of the
biggest issues in such systems is the development of effective
techniques/algorithms for the distribution of the processes/load of a
parallel program on multiple hosts to achieve goal(s) such as
minimizing execution time, minimizing communication delays,
maximizing resource utilization and maximizing throughput.
Substantive research using queuing analysis and assuming job
arrivals following a Poisson pattern, have shown that in a multi-host
system the probability of one of the hosts being idle while other host
has multiple jobs queued up can be very high. Such imbalances in
system load suggest that performance can be improved by either
transferring jobs from the currently heavily loaded hosts to the lightly
loaded ones or distributing load evenly/fairly among the hosts .The
algorithms known as load balancing algorithms, helps to achieve the
above said goal(s). These algorithms come into two basic categories -
static and dynamic. Whereas static load balancing algorithms (SLB)
take decisions regarding assignment of tasks to processors based on
the average estimated values of process execution times and
communication delays at compile time, Dynamic load balancing
algorithms (DLB) are adaptive to changing situations and take
decisions at run time.
The objective of this paper work is to identify qualitative
parameters for the comparison of above said algorithms. In future this
work can be extended to develop an experimental environment to
study these Load balancing algorithms based on comparative
parameters quantitatively.
Abstract: Sorting appears the most attention among all computational tasks over the past years because sorted data is at the heart of many computations. Sorting is of additional importance to parallel computing because of its close relation to the task of routing data among processes, which is an essential part of many parallel algorithms. Many parallel sorting algorithms have been investigated for a variety of parallel computer architectures. In this paper, three parallel sorting algorithms have been implemented and compared in terms of their overall execution time. The algorithms implemented are the odd-even transposition sort, parallel merge sort and parallel rank sort. Cluster of Workstations or Windows Compute Cluster has been used to compare the algorithms implemented. The C# programming language is used to develop the sorting algorithms. The MPI (Message Passing Interface) library has been selected to establish the communication and synchronization between processors. The time complexity for each parallel sorting algorithm will also be mentioned and analyzed.
Abstract: The control of commutation of switched reluctance
(SR) motor has nominally depended on a physical position detector.
The physical rotor position sensor limits robustness and increases
size and inertia of the SR drive system. The paper describes a method
to overcome these limitations by using magnetization characteristics
of the motor to indicate rotor and stator teeth overlap status. The
method is using active current probing pulses of same magnitude that
is used to simulate flux linkage in the winding being probed. A
microprocessor is used for processing magnetization data to deduce
rotor-stator teeth overlap status and hence rotor position. However,
the back-of-core saturation and mutual coupling introduces overlap
detection errors, hence that of commutation control. This paper
presents the concept of the detection scheme and the effects of backof
core saturation.
Abstract: An efficient parallel form in digital signal processor can improve the algorithm performance. The butterfly structure is an important role in fast Fourier transform (FFT), because its symmetry form is suitable for hardware implementation. Although it can perform a symmetric structure, the performance will be reduced under the data-dependent flow characteristic. Even though recent research which call as novel memory reference reduction methods (NMRRM) for FFT focus on reduce memory reference in twiddle factor, the data-dependent property still exists. In this paper, we propose a parallel-computing approach for FFT implementation on digital signal processor (DSP) which is based on data-independent property and still hold the property of low-memory reference. The proposed method combines final two steps in NMRRM FFT to perform a novel data-independent structure, besides it is very suitable for multi-operation-unit digital signal processor and dual-core system. We have applied the proposed method of radix-2 FFT algorithm in low memory reference on TI TMSC320C64x DSP. Experimental results show the method can reduce 33.8% clock cycles comparing with the NMRRM FFT implementation and keep the low-memory reference property.
Abstract: Heavy rainfall greatly affects the aerodynamic performance of the aircraft. There are many accidents of aircraft caused by aerodynamic efficiency degradation by heavy rain.
In this Paper we have studied the heavy rain effects on the aerodynamic efficiency of cambered NACA 64-210 and symmetric
NACA 0012 airfoils. Our results show significant increase in drag and decrease in lift. We used preprocessing software gridgen for creation of geometry and mesh, used fluent as solver and techplot as postprocessor. Discrete phase modeling called DPM is used to model the rain particles using two phase flow approach. The rain particles are assumed to be inert.
Both airfoils showed significant decrease in lift and increase in drag in simulated rain environment. The most significant difference between these two airfoils was the NACA 64-210 more sensitivity than NACA 0012 to liquid water content (LWC). We believe that the results showed in this paper will be useful for the designer of the commercial aircrafts and UAVs, and will be helpful for training of the pilots to control the airplanes in heavy rain.
Abstract: With the increasing number of on-chip components and the critical requirement for processing power, Chip Multiprocessor (CMP) has gained wide acceptance in both academia and industry during the last decade. However, the conventional bus-based onchip communication schemes suffer from very high communication delay and low scalability in large scale systems. Network-on-Chip (NoC) has been proposed to solve the bottleneck of parallel onchip communications by applying different network topologies which separate the communication phase from the computation phase. Observing that the memory bandwidth of the communication between on-chip components and off-chip memory has become a critical problem even in NoC based systems, in this paper, we propose a novel 3D NoC with on-chip Dynamic Random Access Memory (DRAM) in which different layers are dedicated to different functionalities such as processors, cache or memory. Results show that, by using our proposed architecture, average link utilization has reduced by 10.25% for SPLASH-2 workloads. Our proposed design costs 1.12% less execution cycles than the traditional design on average.
Abstract: Embedded systems need to respect stringent real
time constraints. Various hardware components included in such
systems such as cache memories exhibit variability and therefore
affect execution time. Indeed, a cache memory access from an
embedded microprocessor might result in a cache hit where the
data is available or a cache miss and the data need to be fetched
with an additional delay from an external memory. It is therefore
highly desirable to predict future memory accesses during
execution in order to appropriately prefetch data without incurring
delays. In this paper, we evaluate the potential of several artificial
neural networks for the prediction of instruction memory
addresses. Neural network have the potential to tackle the nonlinear
behavior observed in memory accesses during program
execution and their demonstrated numerous hardware
implementation emphasize this choice over traditional forecasting
techniques for their inclusion in embedded systems. However,
embedded applications execute millions of instructions and
therefore millions of addresses to be predicted. This very
challenging problem of neural network based prediction of large
time series is approached in this paper by evaluating various neural
network architectures based on the recurrent neural network
paradigm with pre-processing based on the Self Organizing Map
(SOM) classification technique.
Abstract: Many accidents were happened because of fast driving, habitual working overtime or tired spirit. This paper presents a solution of remote warning for vehicles collision avoidance using vehicular communication. The development system integrates dedicated short range communication (DSRC) and global position system (GPS) with embedded system into a powerful remote warning system. To transmit the vehicular information and broadcast vehicle position; DSRC communication technology is adopt as the bridge. The proposed system is divided into two parts of the positioning andvehicular units in a vehicle. The positioning unit is used to provide the position and heading information from GPS module, and furthermore the vehicular unit is used to receive the break, throttle, and othersignals via controller area network (CAN) interface connected to each mechanism. The mobile hardware are built with an embedded system using X86 processor in Linux system. A vehicle is communicated with other vehicles via DSRC in non-addressed protocol with wireless access in vehicular environments (WAVE) short message protocol. From the position data and vehicular information, this paper provided a conflict detection algorithm to do time separation and remote warning with error bubble consideration. And the warning information is on-line displayed in the screen. This system is able to enhance driver assistance service and realize critical safety by using vehicular information from the neighbor vehicles.KeywordsDedicated short range communication, GPS, Control area network, Collision avoidance warning system.
Abstract: This paper describes an efficient hardware implementation of a new technique for interfacing the data exchange between the microprocessor-based systems and the external devices. This technique, based on the use of software/hardware system and a reduced physical address, enlarges the interfacing capacity of the microprocessor-based systems, uses the Direct Memory Access (DMA) to increases the frequency of the new bus, and improves the speed of data exchange. While using this architecture in microprocessor-based system or in computer, the input of the hardware part of our system will be connected to the bus system, and the output, which is a new bus, will be connected to an external device. The new bus is composed of a data bus, a control bus and an address bus. A Xilinx Integrated Software Environment (ISE) 7.1i has been used for the programmable logic implementation.
Abstract: Emergence of smartphones brings to live the concept
of converged devices with the availability of web amenities. Such
trend also challenges the mobile devices manufactures and service
providers in many aspects, such as security on mobile phones,
complex and long time design flow, as well as higher development
cost. Among these aspects, security on mobile phones is getting more
and more attention. Microkernel based virtualization technology will
play a critical role in addressing these challenges and meeting mobile
market needs and preferences, since virtualization provides essential
isolation for security reasons and it allows multiple operating systems
to run on one processor accelerating development and cutting development
cost. However, virtualization benefits do not come for free.
As an additional software layer, it adds some inevitable virtualization
overhead to the system, which may decrease the system performance.
In this paper we evaluate and analyze the virtualization performance
cost of L4 microkernel based virtualization on a competitive mobile
phone by comparing the L4Linux, a para-virtualized Linux on top of
L4 microkernel, with the native Linux performance using lmbench
and a set of typical mobile phone applications.
Abstract: Speedups from mapping four real-life DSP
applications on an embedded system-on-chip that couples coarsegrained
reconfigurable logic with an instruction-set processor are
presented. The reconfigurable logic is realized by a 2-Dimensional
Array of Processing Elements. A design flow for improving
application-s performance is proposed. Critical software parts, called
kernels, are accelerated on the Coarse-Grained Reconfigurable
Array. The kernels are detected by profiling the source code. For
mapping the detected kernels on the reconfigurable logic a prioritybased
mapping algorithm has been developed. Two 4x4 array
architectures, which differ in their interconnection structure among
the Processing Elements, are considered. The experiments for eight
different instances of a generic system show that important overall
application speedups have been reported for the four applications.
The performance improvements range from 1.86 to 3.67, with an
average value of 2.53, compared with an all-software execution.
These speedups are quite close to the maximum theoretical speedups
imposed by Amdahl-s law.
Abstract: Grid computing is growing rapidly in the distributed
heterogeneous systems for utilizing and sharing large-scale resources
to solve complex scientific problems. Scheduling is the most recent
topic used to achieve high performance in grid environments. It aims
to find a suitable allocation of resources for each job. A typical
problem which arises during this task is the decision of scheduling. It
is about an effective utilization of processor to minimize tardiness
time of a job, when it is being scheduled. This paper, therefore,
addresses the problem by developing a general framework of grid
scheduling using dynamic information and an ant colony
optimization algorithm to improve the decision of scheduling. The
performance of various dispatching rules such as First Come First
Served (FCFS), Earliest Due Date (EDD), Earliest Release Date
(ERD), and an Ant Colony Optimization (ACO) are compared.
Moreover, the benefit of using an Ant Colony Optimization for
performance improvement of the grid Scheduling is also discussed. It
is found that the scheduling system using an Ant Colony
Optimization algorithm can efficiently and effectively allocate jobs
to proper resources.
Abstract: In order to protect original data, watermarking is first consideration direction for digital information copyright. In addition, to achieve high quality image, the algorithm maybe can not run on embedded system because the computation is very complexity. However, almost nowadays algorithms need to build on consumer production because integrator circuit has a huge progress and cheap price. In this paper, we propose a novel algorithm which efficient inserts watermarking on digital image and very easy to implement on digital signal processor. In further, we select a general and cheap digital signal processor which is made by analog device company to fit consumer application. The experimental results show that the image quality by watermarking insertion can achieve 46 dB can be accepted in human vision and can real-time execute on digital signal processor.
Abstract: Unlike general-purpose processors, digital signal
processors (DSP processors) are strongly application-dependent. To
meet the needs for diverse applications, a wide variety of DSP
processors based on different architectures ranging from the
traditional to VLIW have been introduced to the market over the
years. The functionality, performance, and cost of these processors
vary over a wide range. In order to select a processor that meets the
design criteria for an application, processor performance is usually
the major concern for digital signal processing (DSP) application
developers. Performance data are also essential for the designers of
DSP processors to improve their design. Consequently, several DSP
performance benchmarks have been proposed over the past decade or
so. However, none of these benchmarks seem to have included recent
new DSP applications.
In this paper, we use a new benchmark that we recently developed
to compare the performance of popular DSP processors from Texas
Instruments and StarCore. The new benchmark is based on the
Selectable Mode Vocoder (SMV), a speech-coding program from the
recent third generation (3G) wireless voice applications. All
benchmark kernels are compiled by the compilers of the respective
DSP processors and run on their simulators. Weighted arithmetic
mean of clock cycles and arithmetic mean of code size are used to
compare the performance of five DSP processors.
In addition, we studied how the performance of a processor is
affected by code structure, features of processor architecture and
optimization of compiler. The extensive experimental data gathered,
analyzed, and presented in this paper should be helpful for DSP
processor and compiler designers to meet their specific design goals.
Abstract: This paper presents a tested research concept that
implements a complex evolutionary algorithm, genetic algorithm
(GA), in a multi-microcontroller environment. Parallel Distributed
Genetic Algorithm (PDGA) is employed in adaptive beam forming
technique to reduce power usage of adaptive antenna at WCDMA
base station. Adaptive antenna has dynamic beam that requires more
advanced beam forming algorithm such as genetic algorithm which
requires heavy computation and memory space. Microcontrollers are
low resource platforms that are normally not associated with GAs,
which are typically resource intensive. The aim of this project was to
design a cooperative multiprocessor system by expanding the role of
small scale PIC microcontrollers to optimize WCDMA base station
transmitter power. Implementation results have shown that PDGA
multi-microcontroller system returned optimal transmitted power
compared to conventional GA.
Abstract: Multiprocessor task scheduling is a NP-hard problem and Genetic Algorithm (GA) has been revealed as an excellent technique for finding an optimal solution. In the past, several methods have been considered for the solution of this problem based on GAs. But, all these methods consider single criteria and in the present work, minimization of the bi-criteria multiprocessor task scheduling problem has been considered which includes weighted sum of makespan & total completion time. Efficiency and effectiveness of genetic algorithm can be achieved by optimization of its different parameters such as crossover, mutation, crossover probability, selection function etc. The effects of GA parameters on minimization of bi-criteria fitness function and subsequent setting of parameters have been accomplished by central composite design (CCD) approach of response surface methodology (RSM) of Design of Experiments. The experiments have been performed with different levels of GA parameters and analysis of variance has been performed for significant parameters for minimisation of makespan and total completion time simultaneously.
Abstract: Modular multiplication is the basic operation
in most public key cryptosystems, such as RSA, DSA, ECC,
and DH key exchange. Unfortunately, very large operands
(in order of 1024 or 2048 bits) must be used to provide
sufficient security strength. The use of such big numbers
dramatically slows down the whole cipher system, especially
when running on embedded processors.
So far, customized hardware accelerators - developed on
FPGAs or ASICs - were the best choice for accelerating
modular multiplication in embedded environments. On the
other hand, many algorithms have been developed to speed
up such operations. Examples are the Montgomery modular
multiplication and the interleaved modular multiplication
algorithms. Combining both customized hardware with
an efficient algorithm is expected to provide a much faster
cipher system.
This paper introduces an enhanced architecture for computing
the modular multiplication of two large numbers X
and Y modulo a given modulus M. The proposed design is
compared with three previous architectures depending on
carry save adders and look up tables. Look up tables should
be loaded with a set of pre-computed values. Our proposed
architecture uses the same carry save addition, but replaces
both look up tables and pre-computations with an enhanced
version of sign detection techniques. The proposed architecture
supports higher frequencies than other architectures.
It also has a better overall absolute time for a single operation.