论文标题
MERAK:具有自动化3D并行性的有效分布式DNN培训框架,用于巨型基础模型
Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
论文作者
论文摘要
基础模型正在成为主要的深度学习技术。由于模型参数和训练数据集的大规模,预处理基础模型始终耗时。除了计算密集型外,培训过程是非常密集的和沟通密集的。这些功能使得需要应用3D并行性,该3D并行性将数据并行性,管道模型并行性和张量模型并行性集成以达到高训练效率。 为了实现这一目标,开发了一些自定义软件框架,例如Megatron-LM和DeepSpeed。但是,当前的3D平行框架仍然符合两个问题:i)它们对模型开发人员不透明,这些开发人员需要手动修改模型以并行化训练。 ii)它们对计算,GPU内存和网络带宽的利用不足。我们建议使用高资源利用的自动3D平行性深度学习培训框架Merak。 Merak会自动使用自动模型分区仪部署,该分区仪在模型的代理表示上使用图形碎片算法。梅拉克还提供了非侵入性的API,用于通过最小的代码修改来扩展基础模型培训。此外,我们在梅拉克(Merak)设计了高性能的3D平行运行时引擎。它使用多种技术来利用可用的培训资源,包括带有更高计算利用率的临界路径管道计划,阶段感知的重新组件可利用空闲工作者的记忆以及子额定张量的模型并行性,这些模型并行性与通信和计算重叠。 64个GPU的实验表明,梅拉克可以在1.5、2.5、8.3和20亿个参数(分别高达1.42x,1.39 x,1.43x和1.61x)的模型上加速训练性能。
Foundation models are becoming the dominant deep learning technologies. Pretraining a foundation model is always time-consumed due to the large scale of both the model parameter and training dataset. Besides being computing-intensive, the training process is extremely memory-intensive and communication-intensive. These features make it necessary to apply 3D parallelism, which integrates data parallelism, pipeline model parallelism and tensor model parallelism, to achieve high training efficiency. To achieve this goal, some custom software frameworks such as Megatron-LM and DeepSpeed are developed. However, current 3D parallelism frameworks still meet two issues: i) they are not transparent to model developers, which need to manually modify the model to parallelize training. ii) their utilization of computation, GPU memory and network bandwidth are not sufficient. We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak also presents the non-intrusive API for scaling out foundation model training with minimal code modification. In addition, we design a high-performance 3D parallel runtime engine in Merak. It uses several techniques to exploit available training resources, including shifted critical path pipeline schedule that brings a higher computation utilization, stage-aware recomputation that makes use of idle worker memory, and sub-pipelined tensor model parallelism that overlaps communication and computation. Experiments on 64 GPUs show Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.