论文标题
V-Cloak:可理解性 - 自然性和Timbre的实时声音匿名
V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization
论文作者
论文摘要
在即时消息或社交媒体应用程序上生成的语音数据包含独特的用户语音印刷品,这些用户可能会因身份推断或身份盗用而被恶意对手滥用。现有的语音匿名技术,例如信号处理和语音转换/综合,遭受感知质量的降解。在本文中,我们开发了一个名为V-Cloak的语音匿名系统,该系统在保留音频的清晰度,自然性和音色的同时,可以实现实时声音匿名化。我们设计的匿名器具有一个单发的生成模型,可在不同频率级别调节原始音频的功能。我们通过精心设计的损失功能训练匿名器。除了匿名损失外,我们还将清晰度损失和基于心理声学的自然损失纳入了。匿名器可以实现不靶向和有针对性的匿名化,以实现无法识别性和无链性的匿名目标。 我们已经在四个数据集上进行了广泛的实验,即Librispeech(英语),Aishell(中文),CommorVoice(法语)和CommonVoice(Italian),五个自动扬声器验证(ASV)系统(包括两个基于DNN的DNN,两个基于DNN的统计和两种商业ASV)和一项自动语音识别识别(ASRAGENAGES)。实验结果证实,在匿名性能方面,V-Cloak的表现优于五个基准。我们还证明,仅在Voxceleb1数据集上对ECAPA-TDNN ASV和DEEPSPEECH2 ASR进行训练的V-Cloak对其他ASV具有可转移的匿名性,并且对其他ASR具有跨语言清晰度。此外,我们验证了V-Cloak对各种推翻技术和适应性攻击的鲁棒性。希望V-Cloak可以在棱镜界为我们提供斗篷。
Voice data generated on instant messaging or social media applications contains unique user voiceprints that may be abused by malicious adversaries for identity inference or identity theft. Existing voice anonymization techniques, e.g., signal processing and voice conversion/synthesis, suffer from degradation of perceptual quality. In this paper, we develop a voice anonymization system, named V-Cloak, which attains real-time voice anonymization while preserving the intelligibility, naturalness and timbre of the audio. Our designed anonymizer features a one-shot generative model that modulates the features of the original audio at different frequency levels. We train the anonymizer with a carefully-designed loss function. Apart from the anonymity loss, we further incorporate the intelligibility loss and the psychoacoustics-based naturalness loss. The anonymizer can realize untargeted and targeted anonymization to achieve the anonymity goals of unidentifiability and unlinkability. We have conducted extensive experiments on four datasets, i.e., LibriSpeech (English), AISHELL (Chinese), CommonVoice (French) and CommonVoice (Italian), five Automatic Speaker Verification (ASV) systems (including two DNN-based, two statistical and one commercial ASV), and eleven Automatic Speech Recognition (ASR) systems (for different languages). Experiment results confirm that V-Cloak outperforms five baselines in terms of anonymity performance. We also demonstrate that V-Cloak trained only on the VoxCeleb1 dataset against ECAPA-TDNN ASV and DeepSpeech2 ASR has transferable anonymity against other ASVs and cross-language intelligibility for other ASRs. Furthermore, we verify the robustness of V-Cloak against various de-noising techniques and adaptive attacks. Hopefully, V-Cloak may provide a cloak for us in a prism world.