论文标题
通过手工制作的对抗性示例评估预训练的语言模型的敏感性
Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples
论文作者
论文摘要
大型语言模型开发的最新进展导致公众获得最先进的预训练的语言模型(PLM),包括来自变形金刚(BERT)的生成预训练的预训练的变压器3(GPT-3)和双向编码器表示。但是,实际上,对PLM的评估表明,在培训和开发的微调阶段,它们对对抗性攻击的敏感性。这种攻击可能导致错误的输出,模型生成的仇恨言论以及用户敏感信息的暴露。尽管现有的研究集中在PLM的培训或微调期间的对抗攻击上,但有关这两个发展阶段之间攻击的信息不足。在这项工作中,我们重点介绍了GPT-3公开发布的主要安全漏洞,并进一步研究了其他最先进的PLM中的漏洞。我们将工作限制为尚未经过微调的预培训模型。此外,我们强调了令牌距离最小化的扰动,作为一种有效的对抗方法,绕过受监督和无监督的质量措施。遵循这种方法,在评估语义相似性时,我们观察到文本分类质量的显着降低。
Recent advances in the development of large language models have resulted in public access to state-of-the-art pre-trained language models (PLMs), including Generative Pre-trained Transformer 3 (GPT-3) and Bidirectional Encoder Representations from Transformers (BERT). However, evaluations of PLMs, in practice, have shown their susceptibility to adversarial attacks during the training and fine-tuning stages of development. Such attacks can result in erroneous outputs, model-generated hate speech, and the exposure of users' sensitive information. While existing research has focused on adversarial attacks during either the training or the fine-tuning of PLMs, there is a deficit of information on attacks made between these two development phases. In this work, we highlight a major security vulnerability in the public release of GPT-3 and further investigate this vulnerability in other state-of-the-art PLMs. We restrict our work to pre-trained models that have not undergone fine-tuning. Further, we underscore token distance-minimized perturbations as an effective adversarial approach, bypassing both supervised and unsupervised quality measures. Following this approach, we observe a significant decrease in text classification quality when evaluating for semantic similarity.