论文标题
深层神经网络的自适应黑盒后门检测方法
An Adaptive Black-box Backdoor Detection Method for Deep Neural Networks
论文作者
论文摘要
随着机器学习的激增(ML),已经开发了大量的智能应用程序。深度神经网络(DNN)表现出了各个领域的前所未有的表现,例如医学诊断和自动驾驶。尽管DNN被广泛用于对安全敏感的领域,但它们被确定为受到隐身触发器控制和激活的神经特洛伊木马(NT)攻击的影响。在本文中,我们针对设计强大而适应性的特洛伊木马检测计划,该计划检查在部署之前是否已经对预培训的模型进行了冠军。先前的工作忽略了触发分布的固有特性,并尝试使用简单的启发式方法(即刺激给定模型以使输出不正确)重建触发模式。结果,他们的检测时间和有效性是有限的。我们利用这样的观察值,即像素触发器通常具有空间依赖性,并提出了第一个基于触发近似的黑框Trojan检测框架,该框架可以快速且可扩展的输入空间中的触发器进行快速可扩展的搜索。此外,我们的方法还可以检测到嵌入在特征空间中的特洛伊木马,其中使用某些滤波器转换来激活特洛伊木马。我们进行了广泛的实验,以研究各种数据集和ML模型的方法的性能。经验结果表明,我们的方法在公共Trojai数据集上取得了0.93的ROC-AUC得分。我们的代码可以在https://github.com/xinqiaozhang/adatrojan上找到
With the surge of Machine Learning (ML), An emerging amount of intelligent applications have been developed. Deep Neural Networks (DNNs) have demonstrated unprecedented performance across various fields such as medical diagnosis and autonomous driving. While DNNs are widely employed in security-sensitive fields, they are identified to be vulnerable to Neural Trojan (NT) attacks that are controlled and activated by stealthy triggers. In this paper, we target to design a robust and adaptive Trojan detection scheme that inspects whether a pre-trained model has been Trojaned before its deployment. Prior works are oblivious of the intrinsic property of trigger distribution and try to reconstruct the trigger pattern using simple heuristics, i.e., stimulating the given model to incorrect outputs. As a result, their detection time and effectiveness are limited. We leverage the observation that the pixel trigger typically features spatial dependency and propose the first trigger approximation based black-box Trojan detection framework that enables a fast and scalable search of the trigger in the input space. Furthermore, our approach can also detect Trojans embedded in the feature space where certain filter transformations are used to activate the Trojan. We perform extensive experiments to investigate the performance of our approach across various datasets and ML models. Empirical results show that our approach achieves a ROC-AUC score of 0.93 on the public TrojAI dataset. Our code can be found at https://github.com/xinqiaozhang/adatrojan