论文标题
HAS-NETS:治疗和选择机制,以防御DNN免受数据收集方案的后门攻击
HaS-Nets: A Heal and Select Mechanism to Defend DNNs Against Backdoor Attacks for Data Collection Scenarios
论文作者
论文摘要
我们目睹了后门攻击与深度神经网络(DNNS)相应的防御策略之间的持续武器竞赛。大多数最先进的防御能力依赖于“输入”或“潜在DNN表示”的统计消毒来捕获特洛伊木马行为。在本文中,我们首先通过引入针对性的后门攻击的新型变种,称为“低信心后门攻击”,首先挑战了这种近期报道的防御能力的鲁棒性。我们还提出了一种新颖的防御技术,称为“ Has-Nets”。 “低信心后门攻击”通过在训练和推理期间,将其掩盖给防守者,从而利用了分配给有毒训练样本的置信标签。我们评估了针对四种最先进的防御方法的攻击,即条纹,梯度形状,Februus和ULP防御,并将攻击成功率(ASR)分别为99%,63.73%,91.2%和80%。 接下来,我们将使用一个相当小的愈合数据集(约占完整训练数据的大约2%至15%)来抗拒训练期间的“ has-net”,以抵制培训期间的后门插入,以治愈每次迭代的网络。我们将其评估为不同的数据集 - 时尚摄影师,CIFAR-10,消费者投诉和城市声音 - 以及网络体系结构 - MLP,2D-CNNS,1D-CNNS。我们的实验表明,“ has-net”可以将ASR从90%以上降低到小于15%,而与数据集无关,攻击配置和网络体系结构。
We have witnessed the continuing arms race between backdoor attacks and the corresponding defense strategies on Deep Neural Networks (DNNs). Most state-of-the-art defenses rely on the statistical sanitization of the "inputs" or "latent DNN representations" to capture trojan behaviour. In this paper, we first challenge the robustness of such recently reported defenses by introducing a novel variant of targeted backdoor attack, called "low-confidence backdoor attack". We also propose a novel defense technique, called "HaS-Nets". "Low-confidence backdoor attack" exploits the confidence labels assigned to poisoned training samples by giving low values to hide their presence from the defender, both during training and inference. We evaluate the attack against four state-of-the-art defense methods, viz., STRIP, Gradient-Shaping, Februus and ULP-defense, and achieve Attack Success Rate (ASR) of 99%, 63.73%, 91.2% and 80%, respectively. We next present "HaS-Nets" to resist backdoor insertion in the network during training, using a reasonably small healing dataset, approximately 2% to 15% of full training data, to heal the network at each iteration. We evaluate it for different datasets - Fashion-MNIST, CIFAR-10, Consumer Complaint and Urban Sound - and network architectures - MLPs, 2D-CNNs, 1D-CNNs. Our experiments show that "HaS-Nets" can decrease ASRs from over 90% to less than 15%, independent of the dataset, attack configuration and network architecture.