论文标题
ROCE拥塞控制政策对DNN分布培训的影响
Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
论文作者
论文摘要
RDMA超过融合以太网(ROCE),由于其与常规以太网的织物的兼容性,对数据中心网络的吸引力很大。但是,RDMA协议仅在(几乎)无损网络上有效,这强调了拥塞控制对ROCE网络的重要作用。不幸的是,基于优先流量控制(PFC)的本地ROCE拥塞控制计划遭受了许多缺点,例如不公平,线路阻滞和僵局。因此,近年来,已经提出许多计划为ROCE网络提供额外的拥塞控制,以最大程度地减少PFC缺点。但是,这些方案是针对一般数据中心环境提出的。与使用商品硬件构建并运行通用工作负载并运行高性能分布式培训平台的一般数据中心相反,部署高端加速器和网络组件,以及使用集体(All-Reeduce,All-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-Oll-All)通信库进行通信。此外,这些平台通常具有一个私人网络,将其通信流量与数据中心流量的其余部分分开。可扩展的拓扑感知集体算法固有地设计旨在避免造成的模式并最佳地平衡流量。这些独特的特征需要重新访问先前提出的通用数据中心环境的拥塞控制方案。在本文中,我们彻底分析了在分布式培训平台上运行时的一些SOTA ROCE拥塞控制方案与PFC。我们的结果表明,先前提出的ROCE拥塞控制计划对培训工作负载的端到端表现几乎没有影响,这激发了基于分布式培训平台和工作负载的特征设计优化但低空的拥塞控制计划的必要性。
RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the vital role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, based on Priority Flow Control (PFC), suffers from many drawbacks such as unfairness, head-of-line-blocking, and deadlock. Therefore, in recent years many schemes have been proposed to provide additional congestion control for RoCE networks to minimize PFC drawbacks. However, these schemes are proposed for general datacenter environments. In contrast to the general datacenters that are built using commodity hardware and run general-purpose workloads, high-performance distributed training platforms deploy high-end accelerators and network components and exclusively run training workloads using collectives (All-Reduce, All-To-All) communication libraries for communication. Furthermore, these platforms usually have a private network, separating their communication traffic from the rest of the datacenter traffic. Scalable topology-aware collective algorithms are inherently designed to avoid incast patterns and balance traffic optimally. These distinct features necessitate revisiting previously proposed congestion control schemes for general-purpose datacenter environments. In this paper, we thoroughly analyze some of the SOTA RoCE congestion control schemes vs. PFC when running on distributed training platforms. Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.