论文标题
线性时间的变压器质量
Transformer Quality in Linear Time
论文作者
论文摘要
我们在变压器中重新审视设计选择,并提出方法来解决它们在处理长序列中的弱点。首先,我们提出了一个名为门控注意单元的简单层,该层允许使用较弱的单头注意力,而质量损失最小。然后,我们提出了一种与该新层的线性近似方法互补的,该方法对加速器友好且质量高度竞争。 The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9$\times$ on Wiki-40B and 12.1$\times$ on PG-19 for auto-regressive language modeling, and 4.8$\times$ on C4 for masked language modeling.
We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9$\times$ on Wiki-40B and 12.1$\times$ on PG-19 for auto-regressive language modeling, and 4.8$\times$ on C4 for masked language modeling.