Optimization of SAD Algorithm on VLIW DSP

SAD (Sum of Absolute Difference) algorithm is heavily used in motion estimation which is computationally highly demanding process in motion picture encoding. To enhance the performance of motion picture encoding on a VLIW processor, an efficient implementation of SAD algorithm on the VLIW processor is essential. SAD algorithm is programmed as a nested loop with a conditional branch. In VLIW processors, loop is usually optimized by software pipelining, but researches on optimal scheduling of software pipelining for nested loops, especially nested loops with conditional branches are rare. In this paper, we propose an optimal scheduling and implementation of SAD algorithm with conditional branch on a VLIW DSP processor. The proposed optimal scheduling first transforms the nested loop with conditional branch into a single loop with conditional branch with consideration of full utilization of ILP capability of the VLIW processor and realization of earlier escape from the loop. Next, the proposed optimal scheduling applies a modulo scheduling technique developed for single loop. Based on this optimal scheduling strategy, optimal implementation of SAD algorithm on TMS320C67x, a VLIW DSP is presented. Through experiments on TMS320C6713 DSK, it is shown that H.263 encoder with the proposed SAD implementation performs better than other H.263 encoder with other SAD implementations, and that the code size of the optimal SAD implementation is small enough to be appropriate for embedded environments.




References:
[1] P. G. Paulin, et al., "Embedded Software in Real-Time Signal Processing
Systems: Application and Architecture Trends," Proc. of IEEE, Vol. 85,
No.3, pp. 419-435, Mar. 1997.
[2] M. Budagavi et. al., "Wireless MPEG-4 Video Communication on DSP
chips," IEEE Signal Processing Magazine, pp. 36-53, Jan. 2000.
[3] J. Hennessy and D. Patterson, Computer Architecture A Quantitative
Approach, 3rd ed., MK Pub. 2003.
[4] J. A. Fisher, P. Farabosch, and C. Young, Embedded Computing A VLIW
Approach to Architecture, Compilers, and Tools, Morgan Kaufmann,
2004.
[5] K. Muthukumar and G. Doshi, "Software pipelining of nested loops," In
R. Wilhelm, editor, CC 2001, LNCS 2027, pages 165-181.
Springer-Verlag, Berlin Heidelberg, 2001.
[6] R. Scales, Nested loop optimization on the TMS320C6x. Application
Report SPRA519, Texas Instruments, Feb. 1999.
[7] Q. Zhuge, Z. Shao, and E. H.-M. Sha, "Optimization of Nest-Loop
Software Pipelining," Submitted paper, http://www.utdallas.edu/~edsha/
[8] G. Gupta, and C. Chakrabarti, "Architectures for hierarchical and other
block matching algorithms," Circuits and Systems for Video Technology,
IEEE Transactions on Volume 5, Issue 6, pp. 477-489, Dec. 1995.
[9] W. Hwang, Y. Lu, Y. Zeng, "Fast block-matching algorithm for video
coding," Electronics Letters Volume 33, Issue 10, pp. 833 - 835, May
1997
[10] D. Talla, L.K. John, V. Lapinskii, and B.L. Evans, "Evaluating signal
processing and multimedia applications on SIMD, VLIW and Superscalar
architectures," Proceedings of 2000 International Conference on
Computer Design, pp. 163-172, Sept. 2000.
[11] S. M. Akramullah, I. Ahmad, I. and M.L. Liou, "Optimization of H.263
video encoding using a single processor computer: performance tradeoffs
and benchmarking," Circuits and Systems for Video Technology, IEEE
Transactions on Volume 11, Issue 8, pp. 901-915, Aug. 2001.
[12] H. Miyazawa, H.263 Encoder: TMS320C6000 Implementation,
Application Report SPRA721, Texas Instruments, December, 2000
[13] O. Lehtoranta, T. Hamalainen, J. Saarinen, "Real-time H.263 encoding of
QCIF-images on TMS320C6201 fixed point DSP," Circuits and Systems,
Proceedings of IEEE International Symposium on Circuits and Systems,
Volume 1, 28-31 pp. 583 - 586, 2000.
[14] S. Bangerjee, et.al., "VLIW DSP vs. Superscalar Implementation of a
Baseline H.263 Video Encoder," 34th IEEE Alsilomar Conf. on Signals,
Systems and Computers, vol. 2, pp. 1665-1669, 2000.
[15] TMS320C62x Image Library, http://focus.ti.com/docs/toolsw/ folders/
print/sprc093.html
[16] B. Erol, F. Kossentini, and H. Alnuweiri, "Implementation of a fast
H.263+ encoder/decoder," Proc. IEEE Asilomar Conf. Alsilomar Conf. on
Signals, Systems and Computers, vol.1, pp.462-466, Nov. 1998.
[17] Iain E G Richardson, Video CODEC Design Developing image and video
compression systems, John Wiley & Sons, 2002.
[18] M. S. Lam, "Software pipelining: An effective scheduling technique for
VLIW machines," in Proc. ACM SIGPLAN 1988 Conference on
Programming Language Design and Implementation, pp. 318-328, Jun.
1988.
[19] B. R. Rau, "Iterative Modulo Scheduling," International Journal of
Parallel Processing, Vol. 24, No. 1, Feb. 1996.
[20] TMS320C6713 Datasheet, No. SPRU186D, Texas Instruments, May,
2003.
[21] TMS320C6000 CPU and Instruction Set Reference Guide, No.
SPRU189F, Texas Instruments, Oct., 2000.