并非所有的GPU都是平等的：表征大规模加速器富含系统中的可变性

论文标题

并非所有的GPU都是平等的：表征大规模加速器富含系统中的可变性

Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems

论文作者

Sinha, Prasoon, Guliani, Akhil, Jain, Rutwik, Tran, Brandon, Sinclair, Matthew D., Venkataraman, Shivaram

论文摘要

科学家越来越多地探索和利用通用加速器（例如GPU）的大量并行性，例如GPU。结果，数据中心，高标准，国家计算中心和超级计算机已经购买了硬件，以支持这种不断发展的应用程序范式。这些系统包含数百至成千上万的加速器，可实现PETA和EXA规模的计算水平，以进行科学工作负载。最近的工作表明，即使机器具有相同的体系结构和SKU（库存单元），电源管理（PM）也会影响基于CPU的HPC系统的应用程序性能。由于制造可变性和芯片的PM而发生这种变化。但是，尽管现代HPC系统广泛采用了GPU等加速器，但尚不清楚这种可变性对应用的影响有多大。因此，我们试图表征现代HPC和超级计算系统中GPU PM的变化程度。我们研究了各种应用，这些应用在五个大型计算中心上强调了不同的GPU组件：橡树岭的峰会，桑迪亚的涡流，TACC的Frontera和Longhorn，以及Livermore的Corona。这些集群使用各种冷却方法和GPU供应商。在这些簇中，我们总共收集了超过90％的GPU的18,800小时数据。无论应用如何，群集，GPU供应商和冷却方法，我们的结果都会显示出显着的变化：即使GPU体系结构和供应商SKU在每个群集中相同，平均性能变化也有8％（22％），并且在每个群集中相同，而异常值的分离器比中间GPU慢1.5倍。这些结果突出了有效地使用现有的GPU群集来实现现代HPC和科学工作负载的困难，以及在将来基于加速器的系统中采用可变性的必要性。

Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers have procured hardware to support this evolving application paradigm. These systems contain hundreds to tens of thousands of accelerators, enabling peta- and exa-scale levels of compute for scientific workloads. Recent work demonstrated that power management (PM) can impact application performance in CPU-based HPC systems, even when machines have the same architecture and SKU (stock keeping unit). This variation occurs due to manufacturing variability and the chip's PM. However, while modern HPC systems widely employ accelerators such as GPUs, it is unclear how much this variability affects applications. Accordingly, we seek to characterize the extent of variation due to GPU PM in modern HPC and supercomputing systems. We study a variety of applications that stress different GPU components on five large-scale computing centers with modern GPUs: Oak Ridge's Summit, Sandia's Vortex, TACC's Frontera and Longhorn, and Livermore's Corona. These clusters use a variety of cooling methods and GPU vendors. In total, we collect over 18,800 hours of data across more than 90% of the GPUs in these clusters. Regardless of the application, cluster, GPU vendor, and cooling method, our results show significant variation: 8% (max 22%) average performance variation even though the GPU architecture and vendor SKU are identical within each cluster, with outliers up to 1.5X slower than the median GPU. These results highlight the difficulty in efficiently using existing GPU clusters for modern HPC and scientific workloads, and the need to embrace variability in future accelerator-based systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题