视频时刻检索的框架跨模式匹配

论文标题

视频时刻检索的框架跨模式匹配

Frame-wise Cross-modal Matching for Video Moment Retrieval

论文作者

Tang, Haoyu, Zhu, Jihua, Liu, Meng, Gao, Zan, Cheng, Zhiyong

论文摘要

视频时刻检索目标是在视频中检索给定语言查询的片刻。此任务的挑战包括1）在未修剪视频中本地定位相关时刻的要求，以及2）弥合文本查询和视频内容之间的语义差距。为了解决这些问题，早期的方法采用滑动窗口或统一抽样来收集视频剪辑，然后将每个剪辑与查询匹配。显然，这些策略是耗时的，并且由于黄金时刻的不可预测的长度而导致本地化的准确性不足。为了避免局限性，研究人员最近尝试直接预测相关的力矩边界，而无需先生成视频剪辑。一种主流方法是为目标查询和视频帧（例如，串联）生成多模式特征向量，然后在多模式特征向量向量矢量上使用回归方法进行边界检测。尽管这种方法已经取得了一些进展，但我们认为这些方法并没有很好地捕获查询和视频帧之间的跨模式相互作用。在本文中，我们提出了一个细心的跨模式相关性匹配（ACRM）模型，该模型可以根据相互作用建模来预测时间边界。此外，引入了一个注意模块，以将更高的权重分配给具有更丰富语义提示的查询单词，这对于查找相关视频内容而言更为重要。另一个贡献是，我们提出了一个额外的预测指标，以利用模型训练中的内部框架来提高本地化准确性。在两个数据集炸玉米饼和Charades-sta上进行的广泛实验证明了我们方法比几种最新方法的优越性。还进行了消融研究，以检查我们ACRM模型中不同模块的有效性。

Video moment retrieval targets at retrieving a moment in a video for a given language query. The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents. To tackle those problems, early approaches adopt the sliding window or uniform sampling to collect video clips first and then match each clip with the query. Obviously, these strategies are time-consuming and often lead to unsatisfied accuracy in localization due to the unpredictable length of the golden moment. To avoid the limitations, researchers recently attempt to directly predict the relevant moment boundaries without the requirement to generate video clips first. One mainstream approach is to generate a multimodal feature vector for the target query and video frames (e.g., concatenation) and then use a regression approach upon the multimodal feature vector for boundary detection. Although some progress has been achieved by this approach, we argue that those methods have not well captured the cross-modal interactions between the query and video frames. In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM) model which predicts the temporal boundaries based on an interaction modeling. In addition, an attention module is introduced to assign higher weights to query words with richer semantic cues, which are considered to be more important for finding relevant video contents. Another contribution is that we propose an additional predictor to utilize the internal frames in the model training to improve the localization accuracy. Extensive experiments on two datasets TACoS and Charades-STA demonstrate the superiority of our method over several state-of-the-art methods. Ablation studies have been also conducted to examine the effectiveness of different modules in our ACRM model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题