可区分的语言模型对抗性攻击对分类序列分类器

论文标题

可区分的语言模型对抗性攻击对分类序列分类器

Differentiable Language Model Adversarial Attacks on Categorical Sequence Classifiers

论文作者

Fursov, I., Zaytsev, A., Kluchnikov, N., Kravchenko, A., Burnaev, E.

论文摘要

对抗性攻击范式探讨了深度学习模型脆弱性的各种情况：输入的较小变化会迫使模型故障。艺术框架的大部分状态都集中在图像和其他结构化模型输入的对抗性攻击上，而不是分类序列模型。对分类序列的分类器的成功攻击是具有挑战性的，因为模型输入是来自有限集的令牌，因此，对于输入，分类器得分是不可差异的，并且基于梯度的攻击不适用。常见方法处理在令牌级别上工作的此问题，而手头的离散优化问题需要大量资源来解决。相反，我们将语言模型的微调用于对抗性攻击作为对抗性示例的生成器。为了优化模型，我们定义了一个可区分的损耗函数，该损失函数取决于替代分类器分数以及评估近似编辑距离的深度学习模型。因此，我们控制生成序列的可比性及其与初始序列的相似性。结果，我们获得了语义上更好的样本。此外，它们对对抗性训练和对抗探测器具有抵抗力。我们的模型适用于银行交易，电子健康记录和NLP数据集的不同数据集。

An adversarial attack paradigm explores various scenarios for the vulnerability of deep learning models: minor changes of the input can force a model failure. Most of the state of the art frameworks focus on adversarial attacks for images and other structured model inputs, but not for categorical sequences models. Successful attacks on classifiers of categorical sequences are challenging because the model input is tokens from finite sets, so a classifier score is non-differentiable with respect to inputs, and gradient-based attacks are not applicable. Common approaches deal with this problem working at a token level, while the discrete optimization problem at hand requires a lot of resources to solve. We instead use a fine-tuning of a language model for adversarial attacks as a generator of adversarial examples. To optimize the model, we define a differentiable loss function that depends on a surrogate classifier score and on a deep learning model that evaluates approximate edit distance. So, we control both the adversability of a generated sequence and its similarity to the initial sequence. As a result, we obtain semantically better samples. Moreover, they are resistant to adversarial training and adversarial detectors. Our model works for diverse datasets on bank transactions, electronic health records, and NLP datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题