I-bert：变压器对任意上下文长度的归纳概括

论文标题

I-bert：变压器对任意上下文长度的归纳概括

I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths

论文作者

Nam, Hyoungwook, Seo, Seung Byum, Mailthody, Vikram Sharma, Michael, Noor, Li, Lan

论文摘要

近年来，自我注意事项已成为自然语言处理的最新序列模型的重要组成部分，并由预训练的双向变压器模型带到了最前沿。它的有效性部分是由于其非序列体系结构，该体系结构促进了可扩展性和并行性，但将模型限制为有界长度的输入。特别是，此类体系结构在算法任务上的表现较差，在该任务中，该模型必须学习一个程序，该过程将概括为在训练中看不见的输入长度，这是我们称为归纳概括的功能。确定现有自我注意机制的计算限制，我们提出了I-Bert，这是一种双向变压器，用复发层代替位置编码。该模型对各种算法任务进行了概括性的概括，在这些算法任务中，最先进的变压器模型无法做到这一点。我们还测试了掩盖语言建模任务的方法，在该任务中，对训练和验证集进行了分区以验证归纳概括。在三种算法和两个自然语言归纳概括任务中，I-Bert在四个任务上取得了最新的结果。

Self-attention has emerged as a vital component of state-of-the-art sequence-to-sequence models for natural language processing in recent years, brought to the forefront by pre-trained bi-directional Transformer models. Its effectiveness is partly due to its non-sequential architecture, which promotes scalability and parallelism but limits the model to inputs of a bounded length. In particular, such architectures perform poorly on algorithmic tasks, where the model must learn a procedure which generalizes to input lengths unseen in training, a capability we refer to as inductive generalization. Identifying the computational limits of existing self-attention mechanisms, we propose I-BERT, a bi-directional Transformer that replaces positional encodings with a recurrent layer. The model inductively generalizes on a variety of algorithmic tasks where state-of-the-art Transformer models fail to do so. We also test our method on masked language modeling tasks where training and validation sets are partitioned to verify inductive generalization. Out of three algorithmic and two natural language inductive generalization tasks, I-BERT achieves state-of-the-art results on four tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题