图像差异用预训练和对比度学习字幕

论文标题

图像差异用预训练和对比度学习字幕

Image Difference Captioning with Pre-training and Contrastive Learning

论文作者

Yao, Linli, Wang, Weiying, Jin, Qin

论文摘要

图像差异字幕（IDC）任务旨在描述具有自然语言的两个相似图像之间的视觉差异。这项任务的主要挑战在于两个方面：1）需要学习更强的视觉和语言关联的细粒度差异，以及2）手动注释的高度成本，导致有限的监督数据。为了应对这些挑战，我们提出了一个新的建模框架，并在培训前训练范式之后。具体而言，我们设计了三个自我监督的任务和对比学习策略，以使视觉差异和文本描述在细粒度级别保持一致。此外，我们提出了一个数据扩展策略来利用额外的交叉任务监督信息，例如用于细粒图像分类的数据，以减轻可用监督IDC数据的限制。在两个IDC基准数据集（CLEVR-CHANGE和BIRGHT-OD-OD）上进行了广泛的实验，证明了所提出的建模框架的有效性。这些代码和模型将在https://github.com/yaolinli/idc上发布。

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations that leads to limited supervised data. To address these challenges, we propose a new modeling framework following the pre-training-finetuning paradigm. Specifically, we design three self-supervised tasks and contrastive learning strategies to align visual differences and text descriptions at a fine-grained level. Moreover, we propose a data expansion strategy to utilize extra cross-task supervision information, such as data for fine-grained image classification, to alleviate the limitation of available supervised IDC data. Extensive experiments on two IDC benchmark datasets, CLEVR-Change and Birds-to-Words, demonstrate the effectiveness of the proposed modeling framework. The codes and models will be released at https://github.com/yaolinli/IDC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题