一种简单但有效的视频时间基础的方法，并关注交叉模式

论文标题

一种简单但有效的视频时间基础的方法，并关注交叉模式

A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention

论文作者

Zhang, Binjie, Li, Yu, Yuan, Chun, Xu, Dejing, Jiang, Pin, Shan, Ying

论文摘要

语言引导的视频时间基础的任务是将与未修剪视频中查询句子相对应的特定视频剪辑定位。尽管在该领域不断取得进展，但仍需要解决一些问题。首先，大多数现有方法都依赖于多个复杂模块的组合来解决任务。其次，由于两种不同方式之间的语义差距，在视频和语言之间在不同的粒度（本地和全局）处的信息很重要，这一点较少。最后，以前的作品没有考虑到行动边界的歧义而不可避免的注释偏见。为了解决这些局限性，我们提出了一个具有直观结构设计的简单的两分支跨模式关注（CMA）模块，它可以调节两种方式，以更好地匹配本地和全球信息。此外，我们引入了一种新的特定任务回归损失函数，从而通过减轻注释偏差的影响来提高时间基础的准确性。我们进行了广泛的实验来验证我们的方法，结果表明，仅使用此简单模型，它就可以超越Charades-STA和ActivityNet字幕数据集的艺术状态。

The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video. Though progress has been made continuously in this field, some issues still need to be resolved. First, most of the existing methods rely on the combination of multiple complicated modules to solve the task. Second, due to the semantic gaps between the two different modalities, aligning the information at different granularities (local and global) between the video and the language is significant, which is less addressed. Last, previous works do not consider the inevitable annotation bias due to the ambiguities of action boundaries. To address these limitations, we propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design, which alternatively modulates two modalities for better matching the information both locally and globally. In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias. We conduct extensive experiments to validate our method, and the results show that just with this simple model, it can outperform the state of the arts on both Charades-STA and ActivityNet Captions datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题