tutel：尺度上的自适应混合物

论文标题

tutel：尺度上的自适应混合物

Tutel: Adaptive Mixture-of-Experts at Scale

论文作者

Hwang, Changho, Cui, Wei, Xiong, Yifan, Yang, Ziyue, Liu, Ze, Hu, Han, Wang, Zilong, Salas, Rafael, Jose, Jithin, Ram, Prabhat, Chau, Joe, Cheng, Peng, Yang, Fan, Yang, Mao, Xiong, Yongqiang

论文摘要

稀疏门控的专家（MOE）已被广泛采用，以将深度学习模型扩展到具有固定的计算成本的数万亿以上的参数。 MOE的算法性能取决于其令牌路由机制，该设备将每个输入令牌转发到正确的子模型或专家。尽管令牌路由可以动态地确定运行时的专家工作负载的量，但现有系统由于其静态执行，即静态并行性和管道效率而受到效率效率的计算，而该系统无法适应动态工作负载。我们提出了Flex，这是具有动态自适应并行性和管道的高度可扩展的堆栈设计和实现。 Flex设计了一个相同的布局，用于分发MOE模型参数和输入数据，可以通过所有可能的并行性或管道方法来利用它们，而无需任何数学不等值或张量迁移开销。这可以使自适应并行性/管道优化在运行时为零成本。基于此关键设计，Flex还实现了各种MOE加速技术。汇总了所有技术，Flex最终在任何规模上都提供了巨大的加速-4.96倍和5.75倍在16和2,048 A100 GPU上分别在以前的先前先进的时间内分别超过16和2,048 A100 GPU。我们的评估表明，Flex有效地运行了一个基于现实的MOE模型，名为Swinv2-Moe，该模型建立在Swin Transformer V2上，这是一种最先进的计算机视觉体系结构。在效率方面，Flex加速了Swinv2-MoE，在训练和推断FairSeq的训练和推断中，最高可达到1.55倍和2.11倍的速度。在有效性上，SWINV2-MOE模型在训练前和下游计算机视觉任务（例如可可对象检测）中都比对应物密度密度模型达到了较高的精度，这表明FLEX准备就终端到端现实世界模型培训和推理的准备就绪。

Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration overhead. This enables adaptive parallelism/pipelining optimization at zero cost during runtime. Based on this key design, Flex also implements various MoE acceleration techniques. Aggregating all techniques, Flex finally delivers huge speedup at any scale -- 4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs, respectively, over the previous state-of-the-art. Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Flex for end-to-end real-world model training and inference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题