论文标题
GSOFA:GPU上可扩展的稀疏符号LU分解
GSoFa: Scalable Sparse Symbolic LU Factorization on GPUs
论文作者
论文摘要
将矩阵A分解为下部矩阵L和上部矩阵U(也称为LU分解)是数值线性代数的必不可少的操作。对于稀疏矩阵,LU分解通常在L和U因子中引入更多的非零条目,而不是原始矩阵中。需要一个符号分解步骤来识别L和U矩阵的非零结构。被图形处理单元(GPU)的巨大潜力所吸引,一系列努力已经刺激了部署各种LU分解步骤,除了符号分解,据我们所知,在GPU上,符号分解。本文介绍了GSOFA,这是第一个基于GPU的符号分解设计,具有以下三个优化,以实现GPU上非对称模式稀疏矩阵的可扩展LU符号分解。首先,我们介绍了一种新型的细粒平行符号分解算法,该算法非常适合GPU的单个指令多线(SIMT)体系结构。其次,我们将超级节点检测量身定制到SIMT友好过程中,并努力平衡工作量,最大程度地减少通信并在超节点检测过程中饱和GPU计算资源。第三,我们引入了三方面的优化,以减少多源并发符号分解所面临的过度空间消耗问题。综上所述,GSOFA从1到44个峰会节点(6至264 GPU)达到31倍的速度,并且平均超过最先进的CPU项目,以5倍的速度。值得注意的是,GSOFA还达到了峰会中V100 GPU的峰值存储器吞吐量的{高达47%}。
Decomposing matrix A into a lower matrix L and an upper matrix U, which is also known as LU decomposition, is an essential operation in numerical linear algebra. For a sparse matrix, LU decomposition often introduces more nonzero entries in the L and U factors than in the original matrix. A symbolic factorization step is needed to identify the nonzero structures of L and U matrices. Attracted by the enormous potentials of the Graphics Processing Units (GPUs), an array of efforts have surged to deploy various LU factorization steps except for the symbolic factorization, to the best of our knowledge, on GPUs. This paper introduces gSoFa, the first GPU-based Symbolic factorization design with the following three optimizations to enable scalable LU symbolic factorization for nonsymmetric pattern sparse matrices on GPUs. First, we introduce a novel fine-grained parallel symbolic factorization algorithm that is well suited for the Single Instruction Multiple Thread (SIMT) architecture of GPUs. Second, we tailor supernode detection into a SIMT friendly process and strive to balance the workload, minimize the communication and saturate the GPU computing resources during supernode detection. Third, we introduce a three-pronged optimization to reduce the excessive space consumption problem faced by multi-source concurrent symbolic factorization. Taken together, gSoFa achieves up to 31x speedup from 1 to 44 Summit nodes (6 to 264 GPUs) and outperforms the state-of-the-art CPU project, on average, by 5x. Notably, gSoFa also achieves {up to 47%} of the peak memory throughput of a V100 GPU in Summit.