边缘的副语言隐私保护

论文标题

边缘的副语言隐私保护

Paralinguistic Privacy Protection at the Edge

论文作者

Aloufi, Ranya, Haddadi, Hamed, Boyle, David

论文摘要

语音用户界面和数字助手正在迅速进入我们的生活，并成为跨越我们设备的单一接触点。这些始终在线服务将音频数据捕获并传输到强大的云服务中，以进一步处理和随后的操作。通过这些设备收集的我们的声音和原始音频信号包含大量敏感的副语言信息，这些信息被传输给服务提供商，而不管是故意或虚假的触发器。由于我们的身份，性别，幸福感，例如使用深层声学模型，我们可以轻松推断出我们的情感模式和敏感属性，因此我们通过使用这些服务遇到了新一代的隐私风险。减轻基于副语言的隐私漏洞风险的一种方法是利用基于云的处理与隐私保护，设备的副语言信息学习和过滤之前的结合，然后再传输语音数据。在本文中，我们介绍了前卫，这是一个可配置的，轻巧的，分开的表示框架，该框架可以转换和过滤高维语音数据，以识别和包含边缘上的敏感属性，然后再卸载到云到云。我们评估Edgy的设备性能，并探索优化技术，包括模型量化和知识蒸馏，以实现对资源约束设备的私人，准确和高效的表示。我们的结果表明，前卫的运行量为数十毫秒，“零射” ABX分数相对提高0.2％，或使用无需专业硬件的单核手臂处理器从原始语音信号中学习语言信号中的单词错误率（WER）约为5.95％的单词错误率（WER）。

Voice user interfaces and digital assistants are rapidly entering our lives and becoming singular touch points spanning our devices. These always-on services capture and transmit our audio data to powerful cloud services for further processing and subsequent actions. Our voices and raw audio signals collected through these devices contain a host of sensitive paralinguistic information that is transmitted to service providers regardless of deliberate or false triggers. As our emotional patterns and sensitive attributes like our identity, gender, well-being, are easily inferred using deep acoustic models, we encounter a new generation of privacy risks by using these services. One approach to mitigate the risk of paralinguistic-based privacy breaches is to exploit a combination of cloud-based processing with privacy-preserving, on-device paralinguistic information learning and filtering before transmitting voice data. In this paper we introduce EDGY, a configurable, lightweight, disentangled representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud. We evaluate EDGY's on-device performance and explore optimization techniques, including model quantization and knowledge distillation, to enable private, accurate and efficient representation learning on resource-constrained devices. Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in "zero-shot" ABX score or minimal performance penalties of approximately 5.95% word error rate (WER) in learning linguistic representations from raw voice signals, using a CPU and a single-core ARM processor without specialized hardware.

下载PDF全文

下载文献需遵守相关版权规定

论文标题