A Parallel Approach for 3D-Variational Data Assimilation on GPUs in Ocean Circulation Models

This work is the first dowel in a rather wide research activity in collaboration with Euro Mediterranean Center for Climate Changes, aimed at introducing scalable approaches in Ocean Circulation Models. We discuss designing and implementation of a parallel algorithm for solving the Variational Data Assimilation (DA) problem on Graphics Processing Units (GPUs). The algorithm is based on the fully scalable 3DVar DA model, previously proposed by the authors, which uses a Domain Decomposition approach (we refer to this model as the DD-DA model). We proceed with an incremental porting process consisting of 3 distinct stages: requirements and source code analysis, incremental development of CUDA kernels, testing and optimization. Experiments confirm the theoretic performance analysis based on the so-called scale up factor demonstrating that the DD-DA model can be suitably mapped on GPU architectures.




References:
[1] L. Carracciuolo, L. D’Amore, A. Murli, Towards a parallel component
for imaging in PETSc programming environment: A case study in 3-D
echocardiography, Parallel Computing, Vol. 32, (1), 2006, pp. 67-83.
[2] L. D’Amore, R. Arcucci, L. Marcellino and A. Murli, HPC
computation issues of the incremental 3D variational data assimilation
scheme in OceanVar software - Journal of Numerical Analysis,
Industrial and Applied Mathematics, vol. 7, no. 3-4, 2012, pp. 91-105.
[3] L. D’Amore, R. Arcucci, L. Marcellino, A. Murli - A Parallel
Three-dimensional Variational Data Assimilation Scheme - Numerical
Analysis and Applied Mathematics, AIP Conference Proccedings, Vol.
1389, 2011, pp. 1829-1831.
[4] L. D’Amore, R. Arcucci, L. Carracciuolo, A. Murli - DD-OceanVar:
a Domain Decomposition fully parallel Data Assimilation software
in Mediterranean Sea - Procedia Computer Science 18, 2013, pp.
1235-1244.
[5] L. D’Amore, R. Arcucci, L. Carracciuolo, A. Murli - A Scalable
Approach for Variational Data Assimilation - Journal of Scientific
Computing, Vol. 61, 2014, pp. 239-257.
[6] L. D’Amore, D. Casaburi, A. Galletti, L. Marcellino, A. Murli -
Integration of emerging computer technologies for an efficient image
sequences analysis, Vol. 18, (4), 2011, pp. 365-378.
[7] L. D’Amore, A. Murli, V. Boccia, L. Carracciuolo - Insertion of
PETSc in the NEMO stack software Driving NEMO towards Exascale
Computing, High Performance Computing and Simulation (HPCS),
July 2014, pp. 724 - 731, DOI:10.1109/HPCSim.2014.6903761.
[8] L. D’Amore, G. Laccetti, D. Romano, G. Scotti, A. Murli - Towards
a parallel component in a GPU-CUDA environment: a case study
with the L-BFGS Harwell routine - International Journal of Computer
Mathematics, DOI: 10.1080/00207160.2014.899589, 2015, Vol 92 (1),
pp. 59-76.
[9] L. D’Amore , D. Casaburi, A. Galletti, L. Marcellino, A. Murli -
Integration of emerging computer technologies for an efficient image
sequences analysis - Integrated Computer-Aided Engineering, Vol. 18,
(4), 2011, pp. 365-378. [10] S. Dobricic, N. Pinardi, An oceanographic three-dimensional
variational data assimilation scheme - Ocean Modelling 22, 2008, pp.
89-105.
[11] S.A. Haben, A.S. Lawless,N.K. Nichols: Conditioning of the 3DVAR
Data Assimilation Problem, Mathematics Report 3/2009. Department
of Mathematics, University of Reading (2009)
[12] M. Harris - How to Implement Performance Metrics in CUDA C/C++
- November 7 2012, NVIDIA Web Site.
[13] NVIDIA, NVIDIA Compute Unified Device Architecture programming
guide version 2.3, NVIDIA Developer Web Site, (2009). Available at
http://developer.download.nvidia.com.
[14] NVIDIA, NVIDIA CUDA Programming Guide 3.1.1, 2010.
[15] E. Kalnay - Atmospheric Modeling, Data Assimilation and
Predictability. - Cambridge University Press, Cambridge, MA (2003)
[16] Khronos OpenCL Working Group, The OpenCL Specification: Version
797 1.1, 2010.
[17] The NEMO System Home Page - http://www.nemo-ocean.eu
[18] M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen, R.
Pennington, W.M. Hwu - QP: A heterogeneous multi-accelerator
cluster - Proceedings of the 10th LCD International Conference on
High-Performance Clustered Computing, Boulder, Colorado, 2009.
[19] TOP500 Supercomputer Site. 2014. TOP500 Supercomputer
Novermeber 2014 List. http://www.top500.org/lists/2014/11
[20] C. Zhu, R.H. Byrd, P. Lu, and J. Nocedal, Algorithm 778: L-BFGS-B:
Fortran subroutines for large-scale bound constrained optimization,
ACM Trans. Math. Softw. 23, 1997, pp. 550-560.