论文标题
在视频中进行暂时性本地化的细粒度迭代注意力网络
Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos
论文作者
论文摘要
视频中的时间语言本地化旨在根据给定的句子查询在未修剪视频中进行一个视频段。为了解决这项任务,设计一个有效的模型来从视觉和文本方式中提取地面信息至关重要。但是,该领域的大多数尝试都只着眼于从视频到查询的单向交互,这强调了要通过香草柔和的注意来聆听和关注句子信息的单词,但是从查询的video互动中提出的线索暗示着不考虑何处的位置。在本文中,我们提出了一个精细的迭代注意力网络(FIAN),该网络由双边查询 - 视频插图提取的迭代注意力模块组成。具体而言,在迭代的注意模块中,查询中的每个单词首先通过通过细粒度的注意来访问视频中的每个框架,然后进行视频迭代进行集成查询。最后,视频和查询信息都用于提供强大的跨模式表示,以进一步定位。此外,为了更好地预测目标段,我们提出了一种面向内容的本地化策略,而不是应用基于锚的本地化。我们在三个具有挑战性的公共基准上评估了所提出的方法:Ac-TitiveNet标题,炸玉米饼和Charades-STA。未婚夫极大地超过了最先进的方法。
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract ground-ing information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interactions implying where to look are not taken into consideration. In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. Specifically, in the iterative attention module, each word in the query is first enhanced by attending to each frame in the video through fine-grained attention, then video iteratively attends to the integrated query. Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization. In addition, to better predict the target segment, we propose a content-oriented localization strategy instead of applying recent anchor-based localization. We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA. FIAN significantly outperforms the state-of-the-art approaches.