论文标题

坡道:用于分布式深度学习系统的平坦纳秒光学网络和MPI操作

RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

论文作者

Ottino, Alessandro, Benjamin, Joshua, Zervas, Georgios

论文摘要

分布式深度学习(DDL)系统在很大程度上取决于网络性能。当前的电子数据包切换(EPS)网络架构和技术患有可变的直径拓扑,低三分机带宽和超额标准影响交流和集体操作的完成时间。 我们介绍了一个近外的,全率的带宽,全能的单跳,全光网络架构,带有纳米秒重新配置,称为RAMP,它支持大规模的分布式和并行计算系统(每个节点为12.8〜TBPS,最多为65,536个节点)。 首次提出了一种自定义的RAMP-X MPI策略和网络转码器,以以无计划的无且无争议的方式在光电路交换(OCS)网络上运行MPI集体操作。与现实的EPS和OCS相比,坡道在完成所有MPI操作的完成时间的7.6-171 $ \ times $ $加速。它还可以提供1.3-16 $ \ times $和7.8-58 $ \ times $ $减少Megatron和DLRM培训时间},同时提供42-53 $ \ times $ $ \ times $和3.3-12.4 $ \ times $的能源消费和成本的改善。

Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8~Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171$\times$ speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16$\times$ and 7.8-58$\times$ reduction in Megatron and DLRM training time respectively} while offering 42-53$\times$ and 3.3-12.4$\times$ improvement in energy consumption and cost respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源