使用共享存储器在GPU上进行结构化矩阵的分层雅各比迭代

论文标题

使用共享存储器在GPU上进行结构化矩阵的分层雅各比迭代

Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory

论文作者

Islam, Mohammad Shafaet, Wang, Qiqi

论文摘要

高保真科学模拟对物理现象进行建模通常需要求解大型方程式的线性系统，这些方程式是由部分微分方程（PDE）通过某些数值方法而离散的。此步骤通常需要大量的计算时间才能完成，因此在模拟工作中呈现了瓶颈。有效地求解这些线性系统需要使用具有高计算吞吐量的大量并行硬件，以及尊重这些硬件体系结构的内存层次结构以实现高内存带宽的算法的开发。在本文中，我们提出了一种算法来加速Jacobi迭代，以使用分层方法在图形处理单元（GPU）上解决结构化问题，其中在每个片段共享内存中在每个循环中执行多个迭代。采用了域分解样式过程，其中问题域被分区为子域，其数据被复制到每个GPU块的共享内存。 Jacobi迭代在每个块的共享内存中内部执行，避免了每一步执行昂贵的全局内存访问的需求。我们在1D和2D中由Poisson方程离散化引起的线性系统测试算法，并使用我们的共享内存方法观察融合中的加速度，与传统的Jacobi实现相比，该实现仅在GPU上使用全局内存。与传统的GPU方法相比，我们观察到1D问题中的X8加速度的X8加速度，在2D情况下，在2D情况下使用了几乎X6的加速。

High fidelity scientific simulations modeling physical phenomena typically require solving large linear systems of equations which result from discretization of a partial differential equation (PDE) by some numerical method. This step often takes a vast amount of computational time to complete, and therefore presents a bottleneck in simulation work. Solving these linear systems efficiently requires the use of massively parallel hardware with high computational throughput, as well as the development of algorithms which respect the memory hierarchy of these hardware architectures to achieve high memory bandwidth. In this paper, we present an algorithm to accelerate Jacobi iteration for solving structured problems on graphics processing units (GPUs) using a hierarchical approach in which multiple iterations are performed within on-chip shared memory every cycle. A domain decomposition style procedure is adopted in which the problem domain is partitioned into subdomains whose data is copied to the shared memory of each GPU block. Jacobi iterations are performed internally within each block's shared memory, avoiding the need to perform expensive global memory accesses every step. We test our algorithm on the linear systems arising from discretization of Poisson's equation in 1D and 2D, and observe speedup in convergence using our shared memory approach compared to a traditional Jacobi implementation which only uses global memory on the GPU. We observe a x8 speedup in convergence in the 1D problem and a nearly x6 speedup in the 2D case from the use of shared memory compared to a conventional GPU approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题