Deepsonar：有效且可靠地检测AI合成的假声音

论文标题

Deepsonar：有效且可靠地检测AI合成的假声音

DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices

论文作者

Wang, Run, Juefei-Xu, Felix, Huang, Yihao, Guo, Qing, Xie, Xiaofei, Ma, Lei, Liu, Yang

论文摘要

随着声音综合的最新进展，AI合成的假声音与人的耳朵没有区别，并且广泛应用于产生逼真而自然的深击，对我们的社会构成了真正的威胁。但是，合成的假声音的有效且健壮的探测器仍处于起步阶段，并且还没有准备好充分应对这一新兴威胁。在本文中，我们设计了一种新颖的方法，即基于监视说话者识别（SR）系统的神经元行为，\ ie，\ ie，一个深神经网络（DNN），以识别AI合成的假声音。层面上的神经元行为为精心捕获投入之间的差异提供了一个重要的见解，这些差异广泛用于构建安全性，健壮和可解释的DNN。在这项工作中，我们通过猜想来利用层神经元激活模式的功能，即他们可以捕获真实和AI合成的假声音之间的细微差异，而与原始输入相比，为分类器提供了更清洁的信号。实验是在包含英语和中文语言的三个数据集（包括来自Google，Baidu，\ etc的商业产品）上进行的，以证实高检测率（98.1 \％\％平均准确性）和较低的错误警报率（约为2 \％误差率）在张开假声音中的DeepSonar的DeepSonar的较高率（约为2 \％误差率）。此外，广泛的实验结果还证明了其对操纵攻击的鲁棒性（\ eg，语音转换和加性现实世界的噪音）。我们的工作进一步构成了采用神经元行为的新见解，以实现有效且坚固的AI辅助多媒体假货作为内而外的方法，而不是被合成假货中引入的各种文物所激励和摇摆。

With the recent advances in voice synthesis, AI-synthesized fake voices are indistinguishable to human ears and widely are applied to produce realistic and natural DeepFakes, exhibiting real threats to our society. However, effective and robust detectors for synthesized fake voices are still in their infancy and are not ready to fully tackle this emerging threat. In this paper, we devise a novel approach, named \emph{DeepSonar}, based on monitoring neuron behaviors of speaker recognition (SR) system, \ie, a deep neural network (DNN), to discern AI-synthesized fake voices. Layer-wise neuron behaviors provide an important insight to meticulously catch the differences among inputs, which are widely employed for building safety, robust, and interpretable DNNs. In this work, we leverage the power of layer-wise neuron activation patterns with a conjecture that they can capture the subtle differences between real and AI-synthesized fake voices, in providing a cleaner signal to classifiers than raw inputs. Experiments are conducted on three datasets (including commercial products from Google, Baidu, \etc) containing both English and Chinese languages to corroborate the high detection rates (98.1\% average accuracy) and low false alarm rates (about 2\% error rate) of DeepSonar in discerning fake voices. Furthermore, extensive experimental results also demonstrate its robustness against manipulation attacks (\eg, voice conversion and additive real-world noises). Our work further poses a new insight into adopting neuron behaviors for effective and robust AI aided multimedia fakes forensics as an inside-out approach instead of being motivated and swayed by various artifacts introduced in synthesizing fakes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题