tempclr：与对比度学习的时间对齐表示

论文标题

tempclr：与对比度学习的时间对齐表示

TempCLR: Temporal Alignment Representation with Contrastive Learning

论文作者

Yang, Yuncong, Ma, Jiawei, Huang, Shiyuan, Chen, Long, Lin, Xudong, Han, Guangxing, Chang, Shih-Fu

论文摘要

视频表示学习在视频文本预训练的零拍传输方面已经成功，在零拍传输的预训练中，每个句子都经过训练，可以接近公共特征空间中的配对视频剪辑。对于长时间的视频，给定描述段落，句子描述了视频的不同片段，通过匹配所有句子 - clip对，段落和完整视频都隐含地对齐。但是，这种单位级比较可能会忽略全球时间上下文，这不可避免地限制了概括能力。在本文中，我们提出了一个对比度学习框架tempclr，以明确比较完整的视频和段落。由于视频/段落作为剪辑/句子的序列序列，在其时间顺序的约束下，我们使用动态时间扭曲来计算句子折叠对的最低累积成本作为序列级别的距离。为了探索时间动态，我们通过洗牌视频剪辑W.R.T.打破了时间继承的一致性。时间粒度。然后，我们获得剪辑/句子的表示形式，这些表示会感知时间信息，从而促进序列比对。除了在视频和段落上进行预培训外，我们的方法还可以推广到视频实例之间的匹配。我们评估了视频检索，动作步骤本地化和少数动作识别的方法，并在所有三个任务中获得一致的性能增益。提供了详细的消融研究，以证明方法设计是合理的。

Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of description where the sentences describe different segments of the video, by matching all sentence-clip pairs, the paragraph and the full video are aligned implicitly. However, such unit-level comparison may ignore global temporal context, which inevitably limits the generalization ability. In this paper, we propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly. As the video/paragraph is formulated as a sequence of clips/sentences, under the constraint of their temporal order, we use dynamic time warping to compute the minimum cumulative cost over sentence-clip pairs as the sequence-level distance. To explore the temporal dynamics, we break the consistency of temporal succession by shuffling video clips w.r.t. temporal granularity. Then, we obtain the representations for clips/sentences, which perceive the temporal information and thus facilitate the sequence alignment. In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances. We evaluate our approach on video retrieval, action step localization, and few-shot action recognition, and achieve consistent performance gain over all three tasks. Detailed ablation studies are provided to justify the approach design.

下载PDF全文

下载文献需遵守相关版权规定

论文标题