论文标题
使用分区的全球地址空间(PGA)编程进行极端湍流模拟的高度可扩展的粒子跟踪算法
A highly scalable particle tracking algorithm using partitioned global address space (PGAS) programming for extreme-scale turbulence simulations
论文作者
论文摘要
在湍流的直接数值模拟中,据报道了一种使用分区全球地址空间(PGAS)编程模型来实现高扩展性的新的平行算法。这项工作是出于渴望获取在当前和下一代多petaflop超级计算机上可行的最大问题大小的湍流分散所必需的拉格朗日信息的愿望。基于瞬时粒子位置,将大量的流体颗粒分布在平行过程之间,以便每个粒子所需的所有插值信息都可以在其宿主过程中本地获得,也可以在其相邻的速度场的相邻过程中获得。将立方花键作为首选插值方法,新算法旨在通过在相邻过程之间传输仅在特定颗粒所必需的那些样条系数之间传递来最大程度地减少通信的需求。使用共同阵列fortran(CAF)功能,可以非常有效地作为单面通信实现此转移,从而促进了大型全局阵列的不同本地分区之间的小型数据运动。在Urbana-Champaign伊利诺伊大学的Cray Petascale超级计算机蓝色水域上获得了详细的基准。对于$ 8192^3 $仿真($ 0.55 $ $ $ $ cray XE6核)的颗粒操作,发现新算法的数量级相对于先前的算法,在所有算法中,每个粒子在所有时间均在同一平行过程中都更快。改善PGAS模型对主要编译器的支持表明,该算法对大多数即将到来的超级计算机的适用性更高。
A new parallel algorithm utilizing partitioned global address space (PGAS) programming model to achieve high scalability is reported for particle tracking in direct numerical simulations of turbulent flow. The work is motivated by the desire to obtain Lagrangian information necessary for the study of turbulent dispersion at the largest problem sizes feasible on current and next-generation multi-petaflop supercomputers. A large population of fluid particles is distributed among parallel processes dynamically, based on instantaneous particle positions such that all of the interpolation information needed for each particle is available either locally on its host process or neighboring processes holding adjacent sub-domains of the velocity field. With cubic splines as the preferred interpolation method, the new algorithm is designed to minimize the need for communication, by transferring between adjacent processes only those spline coefficients determined to be necessary for specific particles. This transfer is implemented very efficiently as a one-sided communication, using Co-Array Fortran (CAF) features which facilitate small data movements between different local partitions of a large global array. Detailed benchmarks are obtained on the Cray petascale supercomputer Blue Waters at the University of Illinois, Urbana-Champaign. For operations on the particles in a $8192^3$ simulation ($0.55$ trillion grid points) on $262,144$ Cray XE6 cores, the new algorithm is found to be orders of magnitude faster relative to a prior algorithm in which each particle is tracked by the same parallel process at all times. Improving support of PGAS models on major compilers suggests that this algorithm will be of wider applicability on most upcoming supercomputers.