将复发率添加到预处理的变压器中，以提高效率和上下文大小

论文标题

将复发率添加到预处理的变压器中，以提高效率和上下文大小

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

论文作者

Yoshida, Davis, Ettinger, Allyson, Gimpel, Kevin

论文摘要

在过去的几年中，对下游任务进行微调的变压器进行了下游任务已成为NLP的标准方法。尽管这些模型的结果令人印象深刻，但应用它们在计算机上可能非常昂贵，并且使用最新架构的新型号进行了预处理。我们提出了一种应用验证的变压器语言模型的新方法，该模型在训练时间和推理时间都降低了他们的记忆要求。另一个好处是，我们的方法删除了大多数变压器模型所具有的固定上下文大小约束，从而可以更灵活地使用。当应用于GPT-2语言模型时，我们发现我们的方法比PG-19和Wikitext-103 Corpora上未修改的GPT-2模型获得了更高的困惑，以给定的计算或内存。

Fine-tuning a pretrained transformer for a downstream task has become a standard method in NLP in the last few years. While the results from these models are impressive, applying them can be extremely computationally expensive, as is pretraining new models with the latest architectures. We present a novel method for applying pretrained transformer language models which lowers their memory requirement both at training and inference time. An additional benefit is that our method removes the fixed context size constraint that most transformer models have, allowing for more flexible use. When applied to the GPT-2 language model, we find that our method attains better perplexity than an unmodified GPT-2 model on the PG-19 and WikiText-103 corpora, for a given amount of computation or memory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题