DeepNet：将变压器扩展到1,000层

论文标题

DeepNet：将变压器扩展到1,000层

DeepNet: Scaling Transformers to 1,000 Layers

论文作者

Wang, Hongyu, Ma, Shuming, Dong, Li, Huang, Shaohan, Zhang, Dongdong, Wei, Furu

论文摘要

在本文中，我们提出了一种简单而有效的方法来稳定极深的变压器。具体而言，我们引入了一个新的归一化函数（DeepNorm），以修改变压器中的残差连接，并伴随理论得出的初始化。深入的理论分析表明，模型更新可以稳定地界定。所提出的方法结合了两个世界中最好的方法，即，在LN和稳定训练前的良好表现，使DeepNorm成为首选的选择。我们成功地扩展了最大1,000层（即2,500个关注和前馈网络子层）的变压器，这是比以前的Deep Transformers更深的一个数量级。值得注意的是，在具有7,482个翻译方向的多语言基准上，我们具有3.2b参数的200层模型显着优于48层的最先进的模型，其12B参数乘以5 BLEU点，这表明一个有希望的缩放方向。

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题