基于愚蠢的变压器的文本分类器的块对抗性攻击

论文标题

基于愚蠢的变压器的文本分类器的块对抗性攻击

Block-Sparse Adversarial Attack to Fool Transformer-Based Text Classifiers

论文作者

Sadrizadeh, Sahar, Dolamic, Ljiljana, Frossard, Pascal

论文摘要

最近，已经表明，尽管深度神经网络在不同领域的显着性能，但这些网络很容易受到对抗示例的影响。在本文中，我们提出了针对基于变压器的文本分类器的基于梯度的对抗性攻击。我们方法中的对抗扰动被施加为块状障碍，因此所得的对抗示例与原始句子仅在几个单词中有所不同。由于文本数据的离散性质，我们执行梯度投影以找到我们提出的优化问题的最小化器。实验结果表明，尽管我们的对抗性攻击保持了句子的语义，但它可以将GPT-2的准确性降低到不同数据集的5％（AG News，MNLI和Yelp评论）。此外，所提出的优化问题的块 - 比率约束会导致对抗示例中的小扰动。

Recently, it has been shown that, in spite of the significant performance of deep neural networks in different fields, those are vulnerable to adversarial examples. In this paper, we propose a gradient-based adversarial attack against transformer-based text classifiers. The adversarial perturbation in our method is imposed to be block-sparse so that the resultant adversarial example differs from the original sentence in only a few words. Due to the discrete nature of textual data, we perform gradient projection to find the minimizer of our proposed optimization problem. Experimental results demonstrate that, while our adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5% on different datasets (AG News, MNLI, and Yelp Reviews). Furthermore, the block-sparsity constraint of the proposed optimization problem results in small perturbations in the adversarial example.

下载PDF全文

下载文献需遵守相关版权规定

论文标题