具有自动复制/粘贴攻击的深神经网络的诊断

论文标题

具有自动复制/粘贴攻击的深神经网络的诊断

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

论文作者

Casper, Stephen, Hariharan, Kaivalya, Hadfield-Menell, Dylan

论文摘要

本文考虑了帮助人类对深度神经网络（DNNS）进行可扩展监督的问题。对抗性示例可以通过帮助揭示DNN中的弱点来有用，但是很难解释或得出可行的结论。以前的一些作品提出了使用人解剖的对抗性攻击，包括复制/粘贴攻击，其中一种自然图像粘贴到另一个攻击中会导致意外的错误分类。我们以两种贡献为基础。首先，我们使用嵌入式（SNAFUE）介绍搜索自然对抗性特征，该功能提供了一种完全自动化的方法来查找复制/粘贴攻击。其次，我们使用SNAFUE将ImageNet分类器红色团队进行红色团队。我们从以前的作品中复制/粘贴攻击，并找到数百个其他易于描述的漏洞，所有这些漏洞都没有人类。代码可从https://github.com/thestephencasper/snafue获得

This paper considers the problem of helping humans exercise scalable oversight over deep neural networks (DNNs). Adversarial examples can be useful by helping to reveal weaknesses in DNNs, but they can be difficult to interpret or draw actionable conclusions from. Some previous works have proposed using human-interpretable adversarial attacks including copy/paste attacks in which one natural image pasted into another causes an unexpected misclassification. We build on these with two contributions. First, we introduce Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully automated method for finding copy/paste attacks. Second, we use SNAFUE to red team an ImageNet classifier. We reproduce copy/paste attacks from previous works and find hundreds of other easily-describable vulnerabilities, all without a human in the loop. Code is available at https://github.com/thestephencasper/snafue

下载PDF全文

下载文献需遵守相关版权规定

论文标题