通过接受场分析的镜头解剖变压器长度外推

论文标题

通过接受场分析的镜头解剖变压器长度外推

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

论文作者

Chi, Ta-Chung, Fan, Ting-Han, Rudnicky, Alexander I., Ramadge, Peter J.

论文摘要

长度外推允许在短序列上训练变压器语言模型，该模型在基本更长的序列测试时会保留困惑。迄今为止，相对位置嵌入设计Aribi的用法最广泛。我们通过新型累积归一化梯度工具赋予的感受场分析的镜头进行解剖。接收场的概念进一步使我们能够修改香草正弦位置嵌入，以创建〜\ textbf {sandwich}，这是第一个真正的无参数的相对位置嵌入设计，其真正长度信息实际使用的时间比训练序列更长。三明治与Kerple和T5共享具有相同的对数衰减的时间偏置模式，并具有可学习的相对位置嵌入；这些阐明了未来的外推位嵌入设计。

Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to date. We dissect ALiBi via the lens of receptive field analysis empowered by a novel cumulative normalized gradient tool. The concept of receptive field further allows us to modify the vanilla Sinusoidal positional embedding to create ~\textbf{Sandwich}, the first parameter-free relative positional embedding design that truly length information uses longer than the training sequence. Sandwich shares with KERPLE and T5 the same logarithmic decaying temporal bias pattern with learnable relative positional embeddings; these elucidate future extrapolatable positional embedding design.

下载PDF全文

下载文献需遵守相关版权规定

论文标题