使用自动注释的数据集学习视觉语音活动检测

论文标题

使用自动注释的数据集学习视觉语音活动检测

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

论文作者

Guy, Sylvain, Lathuilière, Stéphane, Mesejo, Pablo, Horaud, Radu

论文摘要

视觉语音活动检测（V-VAD）使用视觉特征来预测一个人是否在说话。每当音频VAD（A-VAD）效率低下时，V-VAD很有用，因为声学信号很难分析，或者是因为它简单而缺少。我们提出了两个针对V-VAD的深层架构，一个基于面部标志，另一个基于光流。此外，用于学习和测试V-VAD的可用数据集缺乏内容可变性。我们介绍了一种新颖的方法，可以自动创建和注释非常大的野生数据集 - Wildvad-基于将A-VAD与面部检测和跟踪相结合的基础。彻底的经验评估显示了使用此数据集训练拟议的深V-VAD模型的优势。

Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题