用字幕注释学习视觉表示

论文标题

用字幕注释学习视觉表示

Learning Visual Representations with Caption Annotations

论文作者

Sariyildiz, Mert Bulent, Perez, Julien, Larlus, Diane

论文摘要

预处理通用的视觉特征已成为解决许多计算机视觉任务的关键部分。尽管人们可以在广泛的Imagenet数据集中学习此类功能，但最近的方法已经研究了允许嘈杂，更少甚至没有注释进行此类预处理的方法。从观察到字幕图像很容易爬网的观察开始，我们认为可以利用这种被忽视的信息来源来监督视觉表示的培训。为此，通过语言模型的最新进展激发，我们介绍了{\ em图像条件蒙版的语言建模}（ICMLM） - 一项委托的任务，是通过图像捕捉对学习视觉表示的代理任务。 ICMLM包括通过依靠视觉提示来预测字幕中的蒙版单词。为了解决这项任务，我们提出了使用专用的视觉和文本编码器提出的混合模型，我们表明，视觉表示为解决此任务转移到各种目标任务的副产品学习。我们的实验证实，可以将图像标题利用为将全局和局部语义信息注入视觉表示。项目网站：https：//europe.naverlabs.com/icmlm。

Pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow for noisy, fewer, or even no annotations to perform such pretraining. Starting from the observation that captioned images are easily crawlable, we argue that this overlooked source of information can be exploited to supervise the training of visual representations. To do so, motivated by the recent progresses in language models, we introduce {\em image-conditioned masked language modeling} (ICMLM) -- a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks. Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations. Project website: https://europe.naverlabs.com/icmlm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题