使分布式深度学习培训平台中的计算 - 通信重叠

论文标题

使分布式深度学习培训平台中的计算 - 通信重叠

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

论文作者

Rashidi, Saeed, Denton, Matthew, Sridharan, Srinivas, Srinivasan, Sudarshan, Suresh, Amoghavarsha, Ni, Jade, Krishna, Tushar

论文摘要

深度学习（DL）培训平台是通过通过快速，自定义的互连与带宽100千兆字节（GBS）的快速定制互连来互连（例如GPU/TPU）来构建的。但是，正如我们在这项工作中确定的那样，推动这种带宽非常具有挑战性。这是因为将加速器的计算和内存进行DL计算和通信之间存在一个有害的平衡。这项工作做出了两个关键的贡献。首先，通过实际的系统测量和详细的建模，我们对DL计算和通讯的计算和内存带宽需求提供了理解。其次，我们提出了一个新型的DL集体通信加速器，称为Accelerator Collectives Engine（ACE），该发动机与促进剂端点的计算机和网络引擎坐在一起。 ACE释放了DL计算的端点的计算和内存资源，这又将所需的内存BW平均降低了3.5倍，以驱动与最先进的基线相比，驱动相同的网络BW。对于现代DL工作负载和不同的网络尺寸，ACE平均将有效的网络带宽利用增加1.44倍（高达2.67倍），导致平均1.41倍（高达1.51倍），1.12倍，1.12倍（1.17倍）和1.13x（最高1.13x）（最高1.19倍）和最佳Incn-50，均为iTMEN-50，均为iTMEN-50，均为iTMEN-50，DMMEN-50，DMMEN-50均为iTMEN-50，DEMMEN-GNMENS-50均为GNMENS-50。基线配置分别。

Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and communication. This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint's compute and memory resources for DL compute, which in turn reduces the required memory BW by 3.5X on average to drive the same network BW compared to state-of-the-art baselines. For modern DL workloads and different network sizes, ACE, on average, increases the effective network bandwidth utilization by 1.44X (up to 2.67X), resulting in an average of 1.41X (up to 1.51X), 1.12X (up to 1.17X), and 1.13X (up to 1.19X) speedup in iteration time for ResNet-50, GNMT and DLRM when compared to the best baseline configuration, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题