论文标题
AMRCONVNET:使用卷积神经网络增强AMR编码的语音
AMRConvNet: AMR-Coded Speech Enhancement Using Convolutional Neural Networks
论文作者
论文摘要
语音使用语音编码进行有效传输将语音转换为数字信号。但是,这通常会降低语音的质量和带宽。本文探讨了卷积神经网络在人工带宽扩展(ABE)中的应用和编码语音上的语音增强,尤其是2G蜂窝电话呼叫中使用的自适应多速率(AMR)。在本文中,我们介绍了AMRCONVNET:一个卷积神经网络,该网络在用AMR编码的语音上执行ABE和语音增强。该模型直接在输入和输出语音的时间域上运行,但使用合并的时域重建损失和频率域的感知损失进行了优化。 AMRCONVNET导致AMR比特量为4.75K的平均平均意见分数(MOS-LQO)的平均平均值提高,而AMR比特量为12.2k的MOS-LQO点为0.073。 AMRCONVNET在AMR比特量输入中还显示出鲁棒性。最后,一次消融测试表明,与单独使用任何一种损失相比,我们组合的时间域和频域损失会导致MOS-LQO和更快的训练收敛性。
Speech is converted to digital signals using speech coding for efficient transmission. However, this often lowers the quality and bandwidth of speech. This paper explores the application of convolutional neural networks for Artificial Bandwidth Expansion (ABE) and speech enhancement on coded speech, particularly Adaptive Multi-Rate (AMR) used in 2G cellular phone calls. In this paper, we introduce AMRConvNet: a convolutional neural network that performs ABE and speech enhancement on speech encoded with AMR. The model operates directly on the time-domain for both input and output speech but optimizes using combined time-domain reconstruction loss and frequency-domain perceptual loss. AMRConvNet resulted in an average improvement of 0.425 Mean Opinion Score - Listening Quality Objective (MOS-LQO) points for AMR bitrate of 4.75k, and 0.073 MOS-LQO points for AMR bitrate of 12.2k. AMRConvNet also showed robustness in AMR bitrate inputs. Finally, an ablation test showed that our combined time-domain and frequency-domain loss leads to slightly higher MOS-LQO and faster training convergence than using either loss alone.