视频识别的视听慢速网络

论文标题

视频识别的视听慢速网络

Audiovisual SlowFast Networks for Video Recognition

论文作者

Xiao, Fanyi, Lee, Yong Jae, Grauman, Kristen, Malik, Jitendra, Feichtenhofer, Christoph

论文摘要

我们介绍了视听慢速网络，这是一种用于集成视听感知的体系结构。 Avslowfast具有缓慢且快速的视觉途径，并与更快的音频途径深入融合，以在统一表示形式中模拟视觉和声音。我们将音频和视觉功能融合在多个层中，从而使音频有助于形成层次的视听概念。为了克服来自音频和视觉方式的不同学习动态引起的训练困难，我们引入了DropPathway，这在训练过程中随机降低了音频途径，作为一种有效的正则化技术。受神经科学的先前研究的启发，我们执行了分层视听同步，以学习联合视听特征。我们报告了六个视频动作分类和检测数据集的最新结果，进行详细的消融研究，并显示Avslowfast的概括以学习自我监督的视听功能。代码将提供：https：//github.com/facebookresearch/slowfast。

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features. Code will be made available at: https://github.com/facebookresearch/SlowFast.

下载PDF全文

下载文献需遵守相关版权规定

论文标题