用语言模型的红色小组语言模型

论文标题

用语言模型的红色小组语言模型

Red Teaming Language Models with Language Models

论文作者

Perez, Ethan, Huang, Saffron, Song, Francis, Cai, Trevor, Ring, Roman, Aslanides, John, Glaese, Amelia, McAleese, Nat, Irving, Geoffrey

论文摘要

语言模型（LMS）通常无法部署，因为它们的潜力以难以预测的方式损害用户。先前的工作通过使用人类注释者手工编写测试用例来确定部署前的有害行为。但是，人类注释很昂贵，限制了测试案例的数量和多样性。在这项工作中，我们会自动找到目标LM通过使用另一个LM生成测试用例（“红色组合”）的有害方式行为的案例。我们使用经过训练的分类器来评估目标LM对生成的测试问题的答复，以检测进攻性内容，并在280B参数LM Chatbot中发现了数万个进攻性答复。我们探索了几种方法，从零发电到增强学习，以生成不同多样性和难度水平的测试用例。此外，我们使用及时的工程来控制LM生成的测试用例，以发现其他各种危害，自动发现聊天机器人以令人反感的方式，聊天机器人自己的联系信息，个人和医院的电话号码，个人和医院的电话号码进行讨论的人群，在聊天机器人自己的联系信息，私人培训数据中泄漏了私人培训数据，以及在谈话过程中发生的文本危害。总体而言，基于LM的红色团队是在影响用户之前查找和修复多种多样，不受欢迎的LM行为的一种有前途的工具（其中许多需要）。

Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases. However, human annotation is expensive, limiting the number and diversity of test cases. In this work, we automatically find cases where a target LM behaves in a harmful way, by generating test cases ("red teaming") using another LM. We evaluate the target LM's replies to generated test questions using a classifier trained to detect offensive content, uncovering tens of thousands of offensive replies in a 280B parameter LM chatbot. We explore several methods, from zero-shot generation to reinforcement learning, for generating test cases with varying levels of diversity and difficulty. Furthermore, we use prompt engineering to control LM-generated test cases to uncover a variety of other harms, automatically finding groups of people that the chatbot discusses in offensive ways, personal and hospital phone numbers generated as the chatbot's own contact info, leakage of private training data in generated text, and harms that occur over the course of a conversation. Overall, LM-based red teaming is one promising tool (among many needed) for finding and fixing diverse, undesirable LM behaviors before impacting users.

下载PDF全文

下载文献需遵守相关版权规定

论文标题