Vlanet：弱监督视频时刻检索的视频语言对准网络

论文标题

Vlanet：弱监督视频时刻检索的视频语言对准网络

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

论文作者

Ma, Minuk, Yoon, Sunjae, Kim, Junyeong, Lee, Youngjoon, Kang, Sunghun, Yoo, Chang D.

论文摘要

视频力矩检索（VMR）是在自然语言查询指定的未修剪视频中定位时间时刻的任务。对于VMR，已经提出了几种需要全面监督培训的方法。不幸的是，每个查询的临时界限都具有标记为时间界的大量培训视频，这是一个劳动密集型过程。本文探讨了以弱监督的方式执行VMR的方法（WVMR）：训练是无时间矩标签的，但仅具有描述视频段的文本查询。 WVMR上的现有方法生成多尺度建议，并应用查询引导的注意机制以突出最相关的建议。为了利用弱监督，使用对比度学习，这可以预测正确的视频疑问对的分数要比不正确的对。已经观察到，大量的候选提案，粗制查询表示和单向注意机制导致了限制定位性能的注意力图。为了解决这个问题，提出了视频语言对准网络（VLANET），该网络对准网络（VLANET）通过修剪虚假的候选提议并应用具有细粒度查询表示的多个方向注意机制来学习更引人注目的注意力。替代提案选择模块基于与关节嵌入空间中查询的接近性选择提案，从而大大减少了候选建议，从而导致降低计算负载并引起人们的注意。接下来，级联的跨模式注意模块认为密集的特征相互作用和多向注意力流以学习多模式比对。 Vlanet是使用对比损失的端到端训练有素的，该损失会执行语义上相似的视频和查询来收集。实验表明，该方法在Charades-STA和DIDEMO数据集上实现了最先进的性能。

Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. For VMR, several methods that require full supervision for training have been proposed. Unfortunately, acquiring a large number of training videos with labeled temporal boundaries for each query is a labor-intensive process. This paper explores methods for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video. Existing methods on wVMR generate multi-scale proposals and apply query-guided attention mechanisms to highlight the most relevant proposal. To leverage the weak supervision, contrastive learning is used which predicts higher scores for the correct video-query pairs than for the incorrect pairs. It has been observed that a large number of candidate proposals, coarse query representation, and one-way attention mechanism lead to blurry attention maps which limit the localization performance. To handle this issue, Video-Language Alignment Network (VLANet) is proposed that learns sharper attention by pruning out spurious candidate proposals and applying a multi-directional attention mechanism with fine-grained query representation. The Surrogate Proposal Selection module selects a proposal based on the proximity to the query in the joint embedding space, and thus substantially reduces candidate proposals which leads to lower computation load and sharper attention. Next, the Cascaded Cross-modal Attention module considers dense feature interactions and multi-directional attention flow to learn the multi-modal alignment. VLANet is trained end-to-end using contrastive loss which enforces semantically similar videos and queries to gather. The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题