潜在域预测神经语音编码

论文标题

潜在域预测神经语音编码

Latent-Domain Predictive Neural Speech Coding

论文作者

Jiang, Xue, Peng, Xiulian, Xue, Huaying, Zhang, Yuan, Lu, Yan

论文摘要

神经音频/语音编码最近证明了与传统方法相比，其比特率要低得多的能力。但是，现有的神经音频/语音编解码器采用声学特征或卷积神经网络进行编码的盲目特征，在编码功能中仍然存在时间冗余。本文将潜在域预测性编码引入VQ-VAE框架中，以完全删除此类冗余，并以端到端的方式提出了低延迟神经语音编码的TF-CODEC。具体而言，提取的特征是根据过去量化潜在框架的预测进行编码的，以便进一步删除时间相关性。此外，我们在时间频输入上引入了可学习的压缩，以适应对不同比特率的主要频率和细节的关注。提出了一种基于距离映射和Gumbel-softmax的可区分矢量量化方案，以更好地模拟具有速率约束的潜在分布。多语言语音数据集的主观结果表明，在低潜伏期中，提议的TF-CODEC在1 kbps处的拟议TF-CODEC的质量明显优于9 kbps的Opus，而在3 kbps的TF-CODEC在9.6 kbps和12 kbps的情况下以9.6 kbps的速度优于9.6 kbps。进行了许多研究以证明这些技术的有效性。

Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques.

下载PDF全文

下载文献需遵守相关版权规定

论文标题