Scalable communication for high-order stencil computations using CUDA-aware MPI

Johannes Pekkilä, Miikka Väisälä, Maarit Käpylä, Matthias Rheinhardt, Oskar Lappi

Research output: Contribution to journalArticleScientificpeer-review


Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement is critical for achieving strong scaling in many communication-heavy applications. The performance gap has been further accentuated with the introduction of graphics processing units, which can provide multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate high-order accurate magnetohydrodynamics simulations. We put particular focus on improving intra-node locality of workloads. Based on a theoretical performance model, our implementation scales from $1$ to $64$ devices at $87\%$ efficiency in sixth-order stencil computations where the problem domain consists of $1024^3$ cells.
Original languageEnglish
Publication statusSubmitted - 2021
MoE publication typeA1 Journal article-refereed


  • High-performance computing
  • Graphics processing units
  • Stencil computations
  • Computational physics
  • Magnetohydrodynamics


Dive into the research topics of 'Scalable communication for high-order stencil computations using CUDA-aware MPI'. Together they form a unique fingerprint.

Cite this