坡道：用于分布式深度学习系统的平坦纳秒光学网络和MPI操作

论文标题

坡道：用于分布式深度学习系统的平坦纳秒光学网络和MPI操作

RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

论文作者

Ottino, Alessandro, Benjamin, Joshua, Zervas, Georgios

论文摘要

分布式深度学习（DDL）系统在很大程度上取决于网络性能。当前的电子数据包切换（EPS）网络架构和技术患有可变的直径拓扑，低三分机带宽和超额标准影响交流和集体操作的完成时间。我们介绍了一个近外的，全率的带宽，全能的单跳，全光网络架构，带有纳米秒重新配置，称为RAMP，它支持大规模的分布式和并行计算系统（每个节点为12.8〜TBPS，最多为65,536个节点）。首次提出了一种自定义的RAMP-X MPI策略和网络转码器，以以无计划的无且无争议的方式在光电路交换（OCS）网络上运行MPI集体操作。与现实的EPS和OCS相比，坡道在完成所有MPI操作的完成时间的7.6-171 $ \ times $ $加速。它还可以提供1.3-16 $ \ times $和7.8-58 $ \ times $ $减少Megatron和DLRM培训时间}，同时提供42-53 $ \ times $ $ \ times $和3.3-12.4 $ \ times $的能源消费和成本的改善。

Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8~Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171$\times$ speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16$\times$ and 7.8-58$\times$ reduction in Megatron and DLRM training time respectively} while offering 42-53$\times$ and 3.3-12.4$\times$ improvement in energy consumption and cost respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题