论文标题
Kallima:用于文本后门攻击的清洁标签框架
Kallima: A Clean-label Framework for Textual Backdoor Attacks
论文作者
论文摘要
尽管深度神经网络(DNN)导致了各种自然语言处理(NLP)任务的前所未有的进步,但研究表明,深层模型极易受到后门攻击的影响。现有的后门攻击主要将少数有毒样本注入训练数据集中,标签更改为目标。这种标签错误的样本会在人类检查后引起怀疑,并有可能揭示袭击。为了提高文本后门攻击的隐身性,我们提出了第一个清洁标签框架Kallima,用于合成Mimesis风格的后门样品,以开发隐秘的文本后门攻击。我们通过对抗扰动修改属于目标类的输入,使模型更多地依赖于后门触发器。我们的框架与大多数现有的后门触发器兼容。三个基准数据集的实验结果证明了该方法的有效性。
Although Deep Neural Network (DNN) has led to unprecedented progress in various natural language processing (NLP) tasks, research shows that deep models are extremely vulnerable to backdoor attacks. The existing backdoor attacks mainly inject a small number of poisoned samples into the training dataset with the labels changed to the target one. Such mislabeled samples would raise suspicion upon human inspection, potentially revealing the attack. To improve the stealthiness of textual backdoor attacks, we propose the first clean-label framework Kallima for synthesizing mimesis-style backdoor samples to develop insidious textual backdoor attacks. We modify inputs belonging to the target class with adversarial perturbations, making the model rely more on the backdoor trigger. Our framework is compatible with most existing backdoor triggers. The experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method.