论文标题
探索暂时句子接地的光流引导运动和基于检测的外观
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding
论文作者
论文摘要
时间句子接地旨在根据给定的句子查询将目标段定位在未修剪的视频中。大多数以前的作品都集中在整个视频中每个帧的学习框架级特征上,并将其与文本信息直接匹配。这种框架级特征提取导致这些方法的障碍,以区分具有复杂内容和细微外观差异的模棱两可的视频帧,从而限制了它们的性能。为了区分连续帧之间的细粒度外观相似性,一些最新方法还采用了更快的R-CNN(例如R-CNN)的检测模型,以在每个帧中获得详细的对象级特征,以滤除冗余背景内容。但是,这些方法缺少运动分析,因为更快的R-CNN中的对象检测模块缺乏时间建模。为了减轻上述局限性,在本文中,我们提出了一种新颖的运动和外观引导的3D语义推理网络(MA3SRN),该网络(MA3SRN)结合了光学引导的运动吸引力,基于检测的外观感知,以及3D感知的对象级别,以更好地理由将空间对象关系之间的功能与空间对象关系进行精确建模,从而对对象进行精确建模弗雷姆。具体而言,我们首先开发三个单独的分支,分别编码运动,外观和3D,以学习细粒运动引导,外观引导和3D感知对象特征。然后,相关的运动和外观信息都相关联,以增强最终精确接地的3D感知特征。在三个具有挑战性的数据集(ActivityNet标题,Charades-STA和Tacos)上进行了广泛的实验表明,所提出的MA3SRN模型实现了新的最先进。
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query. Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information. Such frame-level feature extraction leads to the obstacles of these methods in distinguishing ambiguous video frames with complicated contents and subtle appearance differences, thus limiting their performance. In order to differentiate fine-grained appearance similarities among consecutive frames, some state-of-the-art methods additionally employ a detection model like Faster R-CNN to obtain detailed object-level features in each frame for filtering out the redundant background contents. However, these methods suffer from missing motion analysis since the object detection module in Faster R-CNN lacks temporal modeling. To alleviate the above limitations, in this paper, we propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features to better reason the spatial-temporal object relations for accurately modelling the activity among consecutive frames. Specifically, we first develop three individual branches for motion, appearance, and 3D encoding separately to learn fine-grained motion-guided, appearance-guided, and 3D-aware object features, respectively. Then, both motion and appearance information from corresponding branches are associated to enhance the 3D-aware features for the final precise grounding. Extensive experiments on three challenging datasets (ActivityNet Caption, Charades-STA and TACoS) demonstrate that the proposed MA3SRN model achieves a new state-of-the-art.