NetReduce：分布式DNN训练加速的RDMA兼容网络减少

论文标题

NetReduce：分布式DNN训练加速的RDMA兼容网络减少

NetReduce: RDMA-Compatible In-Network Reduction for Distributed DNN Training Acceleration

论文作者

Liu, Shuo, Wang, Qiaoling, Zhang, Junyi, Lin, Qinliang, Liu, Yao, Xu, Meng, Chueng, Ray C. C., He, Jianfei

论文摘要

我们提出了NetReduce，这是一种新型的RDMA兼容的网络内部减少结构，以加速分布式DNN培训。与现有设计相比，NetReduce保持以太网中末端主机之间的可靠连接，并且不会终止网络中的连接。这样做的优点是，我们可以完全重复使用ROCE的拥塞控制和可靠性的设计。同时，我们不需要像IB那样在交换机中实现高成本网络协议处理堆栈。通过使用FPGA实现的原型是一种离式解决方案，而无需修改NIC或Switches之类的商品设备。对于末端主机和开关之间的协调，NetReduce仅在数据消息中的第一个数据包上自定义传输协议以符合ROCE V2。特殊状态监控模块旨在重用ROCE V2的可靠性机制来处理数据包丢失。还提出了一种基于消息级信贷控制算法，以充分利用带宽并避免缓冲区溢出。我们研究带宽对多机械多GPU场景中训练性能的影响，并为层次NetReduce提供足够的条件，以优于其他算法。我们还将设计从机架级聚合扩展到数据中心中更通用的脊柱叶拓扑。 NetReduce分别为基于CNN的CV和基于变压器的NLP任务加速了训练高达1.7倍和1.5倍。大规模系统上的仿真表明NetReduce的卓越可扩展性比最先进的环形全折。

We present NetReduce, a novel RDMA-compatible in-network reduction architecture to accelerate distributed DNN training. Compared to existing designs, NetReduce maintains a reliable connection between end-hosts in the Ethernet and does not terminate the connection in the network. The advantage of doing so is that we can fully reuse the designs of congestion control and reliability in RoCE. In the meanwhile, we do not need to implement a high-cost network protocol processing stack in the switch, as IB does. The prototype implemented by using FPGA is an out-of-box solution without modifying commodity devices such as NICs or switches. For the coordination between the end-host and the switch, NetReduce customizes the transport protocol only on the first packet in a data message to comply with RoCE v2. The special status monitoring module is designed to reuse the reliability mechanism of RoCE v2 for dealing with packet loss. A message-level credit-based flow control algorithm is also proposed to fully utilize bandwidth and avoid buffer overflow. We study the effects of intra bandwidth on the training performance in multi-machines multi-GPUs scenario and give sufficient conditions for hierarchical NetReduce to outperform other algorithms. We also extend the design from rack-level aggregation to more general spine-leaf topology in the data center. NetReduce accelerates the training up to 1.7x and 1.5x for CNN-based CV and transformer-based NLP tasks, respectively. Simulations on large-scale systems indicate the superior scalability of NetReduce to the state-of-the-art ring all-reduce.

下载PDF全文

下载文献需遵守相关版权规定

论文标题