克服自我注意的理论局限

论文标题

克服自我注意的理论局限

Overcoming a Theoretical Limitation of Self-Attention

论文作者

Chiang, David, Cholak, Peter

论文摘要

尽管变形金刚对于许多任务都非常有效，但他们遇到的一些外观令人惊讶的普通语言。哈恩（Hahn）表明，对于接受依赖单个输入符号的语言，变压器的分类决策变得越来越不自信（即，跨熵接近每个字符串的1位），因为输入字符串越来越长。我们使用两种语言检查了这一限制：奇偶校验，奇数为1s的位字符串的语言，首先是以1开始的位字符串的语言。我们演示了克服Hahn的引理所建议的限制的三种方法。首先，我们通过构建一个以完美准确的态度来识别奇偶校验的变压器来解决一个悬而未决的问题，而首先也是如此。其次，我们使用层归一化来使两种模型的跨凝结任意接近零。第三，当变压器需要专注于单个位置时，首先，我们发现它们无法概括为更长的字符串。我们为这个问题提供了一种简单的补救措施，还可以改善机器翻译中的长度概括。

Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer's classification decisions become less and less confident (that is, with cross-entropy approaching 1 bit per string) as input strings get longer and longer. We examine this limitation using two languages: PARITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation suggested by Hahn's lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when transformers need to focus on a single position, as for FIRST, we find that they can fail to generalize to longer strings; we offer a simple remedy to this problem that also improves length generalization in machine translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题