在没有监督的情况下发现语言模型中的潜在知识

论文标题

在没有监督的情况下发现语言模型中的潜在知识

Discovering Latent Knowledge in Language Models Without Supervision

论文作者

Burns, Collin, Ye, Haotian, Klein, Dan, Steinhardt, Jacob

论文摘要

现有的培训语言模型技术可能会与事实失误：如果我们通过模仿学习训练模型，它们可能会重现人类犯的错误；如果我们训练它们以产生高度评价的文本，则他们可能会输出人类评估者无法检测到的错误。我们通过直接以纯粹无监督的方式直接找到语言模型的内部激活中的潜在知识来规避这个问题。具体而言，我们介绍了一种仅在未标记的模型激活下准确回答“是无问题”的方法。它通过在激活空间中找到满足逻辑一致性属性的方向而起作用，例如，陈述及其否定具有相反的真实价值。我们表明，尽管没有使用监督，也没有模型输出，但我们的方法可以恢复大型语言模型中代表的多种知识：跨6个模型和10个提问的数据集，但它的平均表现均优于零拍摄的准确性4 \％。我们还发现，即使提示模型产生错误的答案，它也将迅速的灵敏度降低了一半，并继续保持高精度。我们的结果为发现哪种语言模型所知，与他们所说的不同，即使我们无法访问明确的地面真相标签，我们的结果也提供了第一步。

Existing techniques for training language models can be misaligned with the truth: if we train models with imitation learning, they may reproduce errors that humans make; if we train them to generate text that humans rate highly, they may output errors that human evaluators can't detect. We propose circumventing this issue by directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models: across 6 models and 10 question-answering datasets, it outperforms zero-shot accuracy by 4\% on average. We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题