合成最佳集体算法

论文标题

合成最佳集体算法

Synthesizing Optimal Collective Algorithms

论文作者

Cai, Zixian, Liu, Zhengyang, Maleki, Saeed, Musuvathi, Madan, Mytkowicz, Todd, Nelson, Jacob, Saarikivi, Olli

论文摘要

集体通信算法是分布式计算的重要组成部分。确实，在深入学习的情况下，集体沟通是Amdahl的数据并行培训的瓶颈。本文介绍了SCCL（对于合成的集体通信库），这是一种系统化集体通信算法的系统方法，该算法是针对特定硬件拓扑的明确量身定制的。 SCCL沿着帕累托 - 弗朗特体综合算法，从延迟 - 最佳到一个集体的带宽 - 最佳实现。该论文演示了如何将SCCL的合成作为无量词的SMT公式，该公式可以放电到定理供体。我们进一步演示了如何通过利用拓扑和集体中的对称性来扩展合成。我们合成并介绍了有关两个流行硬件拓扑的文献中未观察到的新型延迟和带宽最佳算法。我们还展示了SCCL如何有效地降低算法到两个硬件体系结构（NVIDIA和AMD）的实现，并通过手工优化的集体通信库来展示竞争性能。

Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training. This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesize collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode SCCL's synthesis as a quantifier-free SMT formula which can be discharged to a theorem prover. We further demonstrate how to scale our synthesis by exploiting symmetries in topologies and collectives. We synthesize and introduce novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题