论文标题
使用自动注释的数据集学习视觉语音活动检测
Learning Visual Voice Activity Detection with an Automatically Annotated Dataset
论文作者
论文摘要
视觉语音活动检测(V-VAD)使用视觉特征来预测一个人是否在说话。每当音频VAD(A-VAD)效率低下时,V-VAD很有用,因为声学信号很难分析,或者是因为它简单而缺少。我们提出了两个针对V-VAD的深层架构,一个基于面部标志,另一个基于光流。此外,用于学习和测试V-VAD的可用数据集缺乏内容可变性。我们介绍了一种新颖的方法,可以自动创建和注释非常大的野生数据集 - Wildvad-基于将A-VAD与面部检测和跟踪相结合的基础。彻底的经验评估显示了使用此数据集训练拟议的深V-VAD模型的优势。
Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.