语言模型是很少的学习者

论文标题

语言模型是很少的学习者

Language Models are Few-Shot Learners

论文作者

Brown, Tom B., Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, Agarwal, Sandhini, Herbert-Voss, Ariel, Krueger, Gretchen, Henighan, Tom, Child, Rewon, Ramesh, Aditya, Ziegler, Daniel M., Wu, Jeffrey, Winter, Clemens, Hesse, Christopher, Chen, Mark, Sigler, Eric, Litwin, Mateusz, Gray, Scott, Chess, Benjamin, Clark, Jack, Berner, Christopher, McCandlish, Sam, Radford, Alec, Sutskever, Ilya, Amodei, Dario

论文摘要

最近的工作证明了通过对大量文本进行预培训，然后对特定任务进行微调，从而在许多NLP任务和基准方面取得了可观的收益。虽然通常在体系结构中进行任务不合时宜，但此方法仍然需要特定于任务的微调数据集的数千或数万个示例。相比之下，人类通常只能从几个示例或简单的说明中执行新的语言任务 - 当前NLP系统仍然在很大程度上很难做到这一点。在这里，我们表明，扩展语言模型会大大改善任务不合时宜的表现，有时甚至通过先前的最新微调方法具有竞争力。具体来说，我们训练GPT-3，这是一种自回归语言模型，具有1750亿个参数，比任何以前的非SPARSE语言模型都多10倍，并在几个弹片设置中测试其性能。对于所有任务，GPT-3无需任何梯度更新或微调即可应用，任务和少量射击演示纯粹是通过与模型的文本交互指定的。 GPT-3在许多NLP数据集上取得了强大的性能，包括翻译，提问和固定任务，以及需要在句子中使用新颖的单词或执行3位数量的3位算术的几个任务，例如未经封闭的推理或域的适应性。同时，我们还确定了一些数据集，其中GPT-3的少数学习仍在努力，以及一些数据集，其中GPT-3面临与大型Web Corpora培训有关的方法论问题。最后，我们发现GPT-3可以生成新闻文章的样本，这些新闻文章很难与人类撰写的文章区分开来。我们讨论了这一发现和GPT-3的更广泛的社会影响。

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

下载PDF全文

下载文献需遵守相关版权规定

论文标题