学会从人类反馈中总结

论文标题

学会从人类反馈中总结

Learning to summarize from human feedback

论文作者

Stiennon, Nisan, Ouyang, Long, Wu, Jeff, Ziegler, Daniel M., Lowe, Ryan, Voss, Chelsea, Radford, Alec, Amodei, Dario, Christiano, Paul

论文摘要

随着语言模型变得越来越强大，用于特定任务的数据和指标越来越多地瓶颈。例如，通常对汇总模型进行培训，以预测人类参考摘要并使用胭脂进行评估，但是这两个指标都是我们真正关心的内容的粗略代理 - 摘要质量。在这项工作中，我们表明，可以通过训练模型来优化人类偏好来显着提高摘要质量。我们收集了摘要之间人类比较的大型，高质量的数据集，训练模型来预测人类偏爱的摘要，并将该模型用作奖励功能，以使用加强学习来微调摘要策略。我们将方法应用于Reddit帖子的TL; DR DataSet版本，并发现我们的模型大大优于人类参考摘要和仅通过监督学习进行微调的更大模型。我们的模型还转移到CNN/DM新闻文章中，摘要几乎与人类参考一样好，而没有任何新闻特定的微调。我们进行了广泛的分析，以了解我们的人类反馈数据集和微调模型，我们确定我们的奖励模型将推广到新数据集，并且优化我们的奖励模型会导致比根据人类优化Rouge更好的摘要。我们希望我们论文的证据激励机器学习研究人员更加关注他们的培训损失如何影响他们真正想要的模型行为。

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

下载PDF全文

下载文献需遵守相关版权规定

论文标题