Accelerating Sparse Matrix Vector Multiplication on Many-Core GPUs

Many-core GPUs provide high computing ability and substantial bandwidth; however, optimizing irregular applications like SpMV on GPUs becomes a difficult but meaningful task. In this paper, we propose a novel method to improve the performance of SpMV on GPUs. A new storage format called HYB-R is proposed to exploit GPU architecture more efficiently. The COO portion of the matrix is partitioned recursively into a ELL portion and a COO portion in the process of creating HYB-R format to ensure that there are as many non-zeros as possible in ELL format. The method of partitioning the matrix is an important problem for HYB-R kernel, so we also try to tune the parameters to partition the matrix for higher performance. Experimental results show that our method can get better performance than the fastest kernel (HYB) in NVIDIA-s SpMV library with as high as 17% speedup.




References:
[1] E.-J. Im, "Optimizing the performance of sparse matrix-vector
multiplication", PhD thesis, University of California, Berkeley, May
2000.
[2] R. Vuduc, " Automatic Performance Tuning of Sparse Matrix Kernels",
PhD thesis, University of California, Berkeley, December 2003.
[3] S. Williams , "Auto-tuning Performance on Multicore Computers", PhD
thesis, University of California, Berkeley, 2008.
[4] Sam Williams, Richard Vuduc, Leonid Oliker, John Shalf, Katherine
Yelick, and James Demmel, "Optimizing sparse matrix-vector multiply
on emerging multicore platforms," Journal of Parallel Computing, vol.
35, no. 3, pp.178-194, March 2009.
[5] Jeff Bolz, Ian Farmer, et al. , "Sparse matrix solvers on the GPU:
Conjugate gradients and multigrid," In Proc. Special Interest Group on
Graphics Conf. (SIGGRAPH), San Diego, CA, USA, July 2003.
[6] Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens,
"Scan primitives for GPU computing," In Proc. ACM Dense Protein
QCD Cantilever Spheres Harbor Ship Wind Tunnel
SIGGRAPH/EUROGRAPHICS Symp. Graphics Hardware, San Diego,
CA, USA, 2007.
[7] Nathan Bell and Michael Garland, "Efficient sparse matrix-vector
multiplication on CUDA," In Proc. ACM/IEEE Conf. Supercomputing
(SC), Portland, OR, USA, November 2009.
[8] Muthu Manikandan Baskaran and Rajesh Bordawekar, "Optimizing
sparse matrix-vector multiplication on GPUs using compile-time and
run-time strategies," Technical Report RC24704 (W0812-047), IBM
T.J.Watson Research Center, Yorktown Heights, NY, USA, December
2008.
[9] Ali Cevahir , Akira Nukada , Satoshi Matsuoka, "Fast Conjugate
Gradients with Multiple GPUs," Proceedings of the 9th International
Conference on Computational Science: Part I, Baton Rouge, LA, May
25-27, 2009.
[10] F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez, "The sparse
matrix vector product on GPUs," Technical report, University of
Almeria, June 2009.
[11] Monakov, A., A. Lokhmotov, and A. Avetisyan, "Automatically tuning
sparse matrix-vector multiplication for GPU architectures," High
Performance Embedded Architectures and Compilers, Lecture Notes in
Computer Science, Vol. 5952, pp.111-125, 2010.
[12] J. W. Choi, A. Singh, and R. Vuduc, "Model-driven autotuning of sparse
matrix-vector multiply on gpus," In PPOPP, pp.115-126, 2010.
[13] Ping Guo, Liqiang Wang, "Auto-Tuning CUDA Parameters for Sparse
Matrix-Vector Multiplication on GPUs," The 2010 International
Conference on Computational and Information Sciences, Chengdu,
China.
[14] A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and E. Leiserson,
"Parallel sparse matrix-vector and matrixtranspose-vector multiplication
using compressed sparse blocks," In SPAA, pp. 233-244, 2009.
[15] A. Buluç, et.al., "Reduced-Bandwidth Multithreaded Algorithms for
Sparse Matrix-Vector Multiplication," In IPDPS, 2011.
[16] K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris, "CSX: An
Extended Compression Format for SpMV on Shared Memory
Systems,"In PPoPP, pp. 247-256, San Antonio, Texas, USA, 2011.
[17] Xintian Yang, Srinivasan Parthasarathy, P. Sadayappan, "Fast Sparse
Matrix-Vector Multiplication on GPUs: Implications for Graph Mining,"
In VLDB, 2011.
[18] NVIDIA CUDA (Compute Unified Device Architecture): Programming
Guide, Version 2.1, December 2008.
[19] J. Willcock, A. Lumsdaine, "Accelerating sparse matrix computations
via data compression," In ICS, Cairns, Australia, June 2006.