对视频中时间定位的边界敏感预训练

论文标题

对视频中时间定位的边界敏感预训练

Boundary-sensitive Pre-training for Temporal Localization in Videos

论文作者

Xu, Mengmeng, Perez-Rua, Juan-Manuel, Escorcia, Victor, Martinez, Brais, Zhu, Xiatian, Zhang, Li, Ghanem, Bernard, Xiang, Tao

论文摘要

许多视频分析任务需要时间定位，从而检测内容变化。但是，针对这些任务开发的大多数现有模型都是在一般视频动作分类任务上预先训练的。这是因为未修剪视频中时间边界的大规模注释很昂贵。因此，没有适合时间边界敏感的预训练的合适数据集。在本文第一次，我们通过引入新型边界敏感借口（BSP）任务来研究模型预训练的时间定位。我们建议在现有视频动作分类数据集中综合时间边界，而不是依赖于昂贵的时间界限。使用综合边界，可以简单地通过对边界类型进行分类来进行BSP。这使得学习视频表示形式更容易转移到下游时间范围的本地化任务。广泛的实验表明，所提出的BSP与现有的基于动作分类的预训练对应物相当，并且在几个时间内定位任务上实现了新的最新性能。

Many video analysis tasks require temporal localization thus detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is because large scale annotation of temporal boundaries in untrimmed videos is expensive. Therefore no suitable datasets exist for temporal boundary-sensitive pre-training. In this paper for the first time, we investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext (BSP) task. Instead of relying on costly manual annotations of temporal boundaries, we propose to synthesize temporal boundaries in existing video action classification datasets. With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types. This enables the learning of video representations that are much more transferable to downstream temporal localization tasks. Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart, and achieves new state-of-the-art performance on several temporal localization tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题