特洛伊木马语言模型以获取乐趣和利润

论文标题

特洛伊木马语言模型以获取乐趣和利润

Trojaning Language Models for Fun and Profit

论文作者

Zhang, Xinyang, Zhang, Zheng, Ji, Shouling, Wang, Ting

论文摘要

近年来，构建自然语言处理（NLP）系统的新范式的出现：通用，预训练的语言模型（LMS）由简单的下游模型组成，并针对各种NLP任务进行了微调。这种范式转移大大简化了系统的开发周期。但是，由于许多LMS由不受信任的第三方提供，因此他们缺乏标准化或法规需要深刻的安全含义，这在很大程度上没有探索。为了弥合这一差距，这项工作研究了恶意LMS对NLP系统构成的安全威胁。具体来说，我们提出了Trojan-LM，这是一种新的Trojaning攻击，其中恶意制作的LMS触发了宿主NLP系统以高度可预测的方式出现故障。通过经验研究在一系列安全至关重要的NLP任务（有毒评论检测，问答，文本完成）以及对众包平台上的用户研究中的三个最先进的LMS（Bert，GPT-2，XLNET），我们证明Trojan-LM具有以下属性：（i）柔韧性（i）柔韧性（e）任意单词作为触发时的“或”，Xor）的功效 - 对手所需的不良行为，当存在触发输入时，可能性很高，（iii）特异性 - 特异性 - 特洛伊木马（iii） - 特洛伊木马（iii）lms与自然输入和（iiv and tupigrts and the trigt and tytign and tytign and the trign and Trigt and trigt and trigt and Trigt intude -IV）的功能 - 与周围环境高度相关。我们为特洛伊木马的实用性提供了分析依据，并进一步讨论了潜在的对策及其挑战，这导致了几个有前途的研究方向。

Recent years have witnessed the emergence of a new paradigm of building natural language processing (NLP) systems: general-purpose, pre-trained language models (LMs) are composed with simple downstream models and fine-tuned for a variety of NLP tasks. This paradigm shift significantly simplifies the system development cycles. However, as many LMs are provided by untrusted third parties, their lack of standardization or regulation entails profound security implications, which are largely unexplored. To bridge this gap, this work studies the security threats posed by malicious LMs to NLP systems. Specifically, we present TROJAN-LM, a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction in a highly predictable manner. By empirically studying three state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP tasks (toxic comment detection, question answering, text completion) as well as user studies on crowdsourcing platforms, we demonstrate that TROJAN-LM possesses the following properties: (i) flexibility - the adversary is able to flexibly dene logical combinations (e.g., 'and', 'or', 'xor') of arbitrary words as triggers, (ii) efficacy - the host systems misbehave as desired by the adversary with high probability when trigger-embedded inputs are present, (iii) specificity - the trojan LMs function indistinguishably from their benign counterparts on clean inputs, and (iv) fluency - the trigger-embedded inputs appear as fluent natural language and highly relevant to their surrounding contexts. We provide analytical justification for the practicality of TROJAN-LM, and further discuss potential countermeasures and their challenges, which lead to several promising research directions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题