论文标题
对热图解释的对抗性攻击的简单防御
A simple defense against adversarial attacks on heatmap explanations
论文作者
论文摘要
随着机器学习模型用于更敏感的应用程序,我们依靠可解释性方法证明没有使用区分属性进行分类。潜在的问题是所谓的“公平洗” - 操纵模型,使现实中使用的功能被隐藏,更无害的特征被证明很重要。 在我们的工作中,我们为对神经网络的这种对抗性攻击提供了有效的防御。通过简单的多种解释方法的汇总,网络可抵抗操纵。即使攻击者对模型权重和所使用的解释方法有确切的了解,也可以。
With machine learning models being used for more sensitive applications, we rely on interpretability methods to prove that no discriminating attributes were used for classification. A potential concern is the so-called "fair-washing" - manipulating a model such that the features used in reality are hidden and more innocuous features are shown to be important instead. In our work we present an effective defence against such adversarial attacks on neural networks. By a simple aggregation of multiple explanation methods, the network becomes robust against manipulation. This holds even when the attacker has exact knowledge of the model weights and the explanation methods used.