Blip：引导语言图像预训练，用于统一视力语言理解和产生

论文标题

Blip：引导语言图像预训练，用于统一视力语言理解和产生

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

论文作者

Li, Junnan, Li, Dongxu, Xiong, Caiming, Hoi, Steven

论文摘要

视力语言预训练（VLP）已提高了许多视觉任务的性能。但是，大多数现有的预训练模型仅在理解基于基于任务的任务或基于世代的任务方面表现出色。此外，通过从网络中收集的嘈杂的图像文本对扩展数据集，这在很大程度上得到了改善，这是次优的监督来源。在本文中，我们提出了BLIP，这是一个新的VLP框架，可以灵活地转移到视觉理解和发电任务。 Blip通过引导字幕有效地利用了嘈杂的Web数据，在该字幕上，字幕仪会生成合成字幕，并且过滤器删除了嘈杂的字幕。我们在各种视觉语言任务上获得最新的结果，例如图像文本检索（平均召回@1+2.7％），图像字幕（苹果酒中的+2.8％）和VQA（VQA分数为+1.6％）。当直接以零拍的方式转移到视频语言任务时，BLIP还表现出强大的概括能力。代码，模型和数据集在https://github.com/salesforce/blip上发布。

Vision-Language Pre-training (VLP) has advanced the performance for many vision-language tasks. However, most existing pre-trained models only excel in either understanding-based tasks or generation-based tasks. Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision. In this paper, we propose BLIP, a new VLP framework which transfers flexibly to both vision-language understanding and generation tasks. BLIP effectively utilizes the noisy web data by bootstrapping the captions, where a captioner generates synthetic captions and a filter removes the noisy ones. We achieve state-of-the-art results on a wide range of vision-language tasks, such as image-text retrieval (+2.7% in average recall@1), image captioning (+2.8% in CIDEr), and VQA (+1.6% in VQA score). BLIP also demonstrates strong generalization ability when directly transferred to video-language tasks in a zero-shot manner. Code, models, and datasets are released at https://github.com/salesforce/BLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题