视频检索的多模式变压器

论文标题

视频检索的多模式变压器

Multi-modal Transformer for Video Retrieval

论文作者

Gabeur, Valentin, Sun, Chen, Alahari, Karteek, Schmid, Cordelia

论文摘要

检索与自然语言查询有关的视频内容的任务在有效处理Internet规模数据集中起着至关重要的作用。此字幕到视频检索问题的大多数现有方法并未完全利用视频中存在的跨模式提示。此外，它们汇总了每个框架的视觉特征，具有有限或没有时间信息。在本文中，我们提出了一个多模式变压器，以共同编码视频中的不同模式，这使他们每个人都可以参与其他方式。变压器体系结构也被利用以编码和建模时间信息。在自然语言方面，我们研究了最佳实践，以与多模式变压器共同优化语言。这个新颖的框架使我们能够在三个数据集上建立视频检索的最新结果。有关更多详细信息，请访问http://thoth.inrialpes.fr/research/mmt。

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题