UMT：联合视频时刻检索的统一多模式变压器并突出显示检测

论文标题

UMT：联合视频时刻检索的统一多模式变压器并突出显示检测

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

论文作者

Liu, Ye, Li, Siyuan, Wu, Yang, Chen, Chang Wen, Shan, Ying, Qie, Xiaohu

论文摘要

根据自然语言查询，在视频中找到相关的时刻和亮点是当前视频内容爆炸时代的自然而有价值的共同需求。然而，即使已经研究了一段时间的研究和某些相关任务，但共同进行时刻检索并突出显示是一个新兴的研究主题。在本文中，我们提出了第一个统一的框架，称为统一的多模式变压器（UMT），能够实现这种关节优化，同时也很容易退化以解决个体问题。据我们所知，这是第一个将多模式（视觉原告）学习集成以进行关节优化或单个瞬间检索任务的方案，并使用新型查询生成器和查询解码器将Moment检索作为关键点检测问题。与现有方法和消融研究的广泛比较有关QVHighlights，Charades-STA，YouTube亮点和TVSUM数据集的比较证明了在各种设置下所提出的方法的有效性，优势和灵活性。源代码和预培训模型可在https://github.com/tencentarc/umt上找到。

Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era. Nevertheless, jointly conducting moment retrieval and highlight detection is an emerging research topic, even though its component problems and some related tasks have already been studied for a while. In this paper, we present the first unified framework, named Unified Multi-modal Transformers (UMT), capable of realizing such joint optimization while can also be easily degenerated for solving individual problems. As far as we are aware, this is the first scheme to integrate multi-modal (visual-audio) learning for either joint optimization or the individual moment retrieval task, and tackles moment retrieval as a keypoint detection problem using a novel query generator and query decoder. Extensive comparisons with existing methods and ablation studies on QVHighlights, Charades-STA, YouTube Highlights, and TVSum datasets demonstrate the effectiveness, superiority, and flexibility of the proposed method under various settings. Source code and pre-trained models are available at https://github.com/TencentARC/UMT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题