论文标题
视听语音分离的深度变分生成模型
Deep Variational Generative Models for Audio-visual Speech Separation
论文作者
论文摘要
在本文中,我们对与每个说话者相关的单渠道音频录制以及视觉信息(嘴唇运动)感兴趣。我们提出了一种基于视听性生成性言语建模的无监督技术。更具体地说,在训练期间,使用变分自动编码器(VAE)从干净的语音谱图中学到了潜在的可变生成模型。为了更好地利用视觉信息,从混合语音(而不是简洁的语音)以及视觉数据中推断出潜在变量的后代。视觉模态还通过视觉网络作为潜在变量的先验。在测试时,将学习的生成模型(用于说话者独立和依赖说话者的方案)与无监督的非负矩阵分解(NMF)方差相结合,用于背景噪声。然后,所有潜在变量和噪声参数将通过蒙特卡洛期望最大化算法估算。我们的实验表明,所提出的基于VAE的方法比基于NMF的方法以及一种受监督的基于深度学习的技术产生更好的分离性能。
In this paper, we are interested in audio-visual speech separation given a single-channel audio recording as well as visual information (lips movements) associated with each speaker. We propose an unsupervised technique based on audio-visual generative modeling of clean speech. More specifically, during training, a latent variable generative model is learned from clean speech spectrograms using a variational auto-encoder (VAE). To better utilize the visual information, the posteriors of the latent variables are inferred from mixed speech (instead of clean speech) as well as the visual data. The visual modality also serves as a prior for latent variables, through a visual network. At test time, the learned generative model (both for speaker-independent and speaker-dependent scenarios) is combined with an unsupervised non-negative matrix factorization (NMF) variance model for background noise. All the latent variables and noise parameters are then estimated by a Monte Carlo expectation-maximization algorithm. Our experiments show that the proposed unsupervised VAE-based method yields better separation performance than NMF-based approaches as well as a supervised deep learning-based technique.