改进的Lite视听语音增强

论文标题

改进的Lite视听语音增强

Improved Lite Audio-Visual Speech Enhancement

论文作者

Chuang, Shang-Yi, Wang, Hsin-Min, Tsao, Yu

论文摘要

大量研究研究了视听多模式学习对语音增强（AVSE）任务的有效性，寻求一种解决方案，该解决方案将视觉数据用作辅助和互补输入来减少嘈杂的语音信号的噪音。最近，我们提出了一种用于汽车驾驶场景的Lite Audio-Visual语音增强（LAVSE）算法。与常规的AVSE系统相比，LAVSE需要更少的在线计算，并且在某种程度上解决了面部数据上的用户隐私问题。在这项研究中，我们扩展了LAVSE，以提高其在实施AVSE系统中经常遇到的三个实际问题的能力，即处理视觉数据，视听异步和低质量的视觉数据的额外成本。所提出的系统被称为改进的LAVSE（ILAVSE），该系统使用卷积复发的神经网络体系结构作为核心AVSE模型。我们使用视频数据集评估了Ilavse的ILAVSE。实验结果证实，与常规的AVSE系统相比，ILAVSE可以有效地克服上述三个实际问题，并可以提高增强性能。结果还证实，ILAVSE适用于现实情况，在这些场景中，高质量的视听传感器可能并不总是可用。

Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario. Compared to conventional AVSE systems, LAVSE requires less online computation and to some extent solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the additional cost of processing visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题