论文标题
变形金刚需要深层记忆吗
Do Transformers Need Deep Long-Range Memory
论文作者
论文摘要
深度注意模型已经提出了许多域上顺序数据的建模。特别是对于语言建模,变压器XL(具有对过去激活的长期记忆的变压器增强)已被证明是各种经过精心良好的基准测试的最先进的。 Transformer-XL在网络的各个层都结合了远程内存,这使其状态比RNN前任大数大。但是,目前尚不清楚这是否有必要。我们执行一组干预措施,以表明可以通过较少的长距离记忆来获得可比性的性能,并且可以通过限制网络下层的注意范围来获得更好的性能。
Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to be state-of-the-art across a variety of well-studied benchmarks. The Transformer-XL incorporates a long-range memory at every layer of the network, which renders its state to be thousands of times larger than RNN predecessors. However it is unclear whether this is necessary. We perform a set of interventions to show that comparable performance can be obtained with 6X fewer long range memories and better performance can be obtained by limiting the range of attention in lower layers of the network.