论文标题
巨型摇滚:与专家混合物的有效稀疏训练
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
论文作者
论文摘要
我们提出了Megablocks,这是一种在GPU上进行有效混合培训(MOE)培训的系统。我们的系统是受当前框架的局限性的激励,该框架限制了MOE层中的动态路由,以满足现有软件和硬件的限制。这些配方迫使模型质量和硬件效率之间的权衡,因为用户必须在从计算中删除令牌或浪费计算与填充上的内存之间进行选择。为了解决这些局限性,我们根据块 - 板块操作对MOE进行了重新计算,并开发了新的块Sparse GPU内核,这些内核有效地处理MOES中存在的动态。我们的方法永远不会有效地将令牌和地图置于现代硬件上,从而使经过最先进的图特尔图书馆训练的Moes端到端的训练速度高达40%,而对经过高度优化的Megatron-LM框架培训的DNNS则可以进行2.4倍。
We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system is motivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfy the constraints of existing software and hardware. These formulations force a tradeoff between model quality and hardware efficiency, as users must choose between dropping tokens from the computation or wasting computation and memory on padding. To address these limitations, we reformulate MoE computation in terms of block-sparse operations and develop new block-sparse GPU kernels that efficiently handle the dynamism present in MoEs. Our approach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups of up to 40% over MoEs trained with the state-of-the-art Tutel library and 2.4x over DNNs trained with the highly-optimized Megatron-LM framework.