论文标题
语言模型是很少的学习者
Language Models are Few-Shot Learners
论文作者
论文摘要
最近的工作证明了通过对大量文本进行预培训,然后对特定任务进行微调,从而在许多NLP任务和基准方面取得了可观的收益。虽然通常在体系结构中进行任务不合时宜,但此方法仍然需要特定于任务的微调数据集的数千或数万个示例。相比之下,人类通常只能从几个示例或简单的说明中执行新的语言任务 - 当前NLP系统仍然在很大程度上很难做到这一点。在这里,我们表明,扩展语言模型会大大改善任务不合时宜的表现,有时甚至通过先前的最新微调方法具有竞争力。具体来说,我们训练GPT-3,这是一种自回归语言模型,具有1750亿个参数,比任何以前的非SPARSE语言模型都多10倍,并在几个弹片设置中测试其性能。对于所有任务,GPT-3无需任何梯度更新或微调即可应用,任务和少量射击演示纯粹是通过与模型的文本交互指定的。 GPT-3在许多NLP数据集上取得了强大的性能,包括翻译,提问和固定任务,以及需要在句子中使用新颖的单词或执行3位数量的3位算术的几个任务,例如未经封闭的推理或域的适应性。同时,我们还确定了一些数据集,其中GPT-3的少数学习仍在努力,以及一些数据集,其中GPT-3面临与大型Web Corpora培训有关的方法论问题。最后,我们发现GPT-3可以生成新闻文章的样本,这些新闻文章很难与人类撰写的文章区分开来。我们讨论了这一发现和GPT-3的更广泛的社会影响。
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.