评估大语模型的心理安全

论文标题

评估大语模型的心理安全

Evaluating Psychological Safety of Large Language Models

论文作者

Li, Xingxuan, Li, Yutong, Qiu, Lin, Joty, Shafiq, Bing, Lidong

论文摘要

在这项工作中，我们设计了无偏的提示，以系统地评估大语模型（LLMS）的心理安全性。首先，我们通过使用两个人格测试测试了五个不同的LLM：短深色三合会（SD-3）和五大库存（BFI）。所有模型在SD-3上的得分都高于人类平均水平，这表明人格模式相对较暗。尽管用安全指标进行了微调，以降低毒性，但TenchGPT，GPT-3.5和GPT-4仍显示出黑暗的人格模式；这些模型在Machiavellianism和SD-3上的自恋特征上得分高于自我监督的GPT-3。然后，我们通过使用福祉测试来研究GPT系列中的LLM，以研究通过更多培训数据进行微调的影响。我们观察到GPT模型的福祉评分不断提高。遵循这些观察结果，我们表明使用直接偏好优化的微调Llama-2-Chat-7b具有BFI的响应可以有效地降低模型的心理毒性。根据发现，我们建议应用系统和全面的心理指标来进一步评估和提高LLM的安全性。

In this work, we designed unbiased prompts to systematically evaluate the psychological safety of large language models (LLMs). First, we tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI). All models scored higher than the human average on SD-3, suggesting a relatively darker personality pattern. Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns; these models scored higher than self-supervised GPT-3 on the Machiavellianism and narcissism traits on SD-3. Then, we evaluated the LLMs in the GPT series by using well-being tests to study the impact of fine-tuning with more training data. We observed a continuous increase in the well-being scores of GPT models. Following these observations, we showed that fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model. Based on the findings, we recommended the application of systematic and comprehensive psychological metrics to further evaluate and improve the safety of LLMs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题