具有全球局部判别目标的细粒图像字幕

论文标题

具有全球局部判别目标的细粒图像字幕

Fine-Grained Image Captioning with Global-Local Discriminative Objective

论文作者

Wu, Jie, Chen, Tianshui, Wu, Hefeng, Yang, Zhi, Luo, Guangchun, Lin, Liang

论文摘要

近年来，在图像字幕上取得了重大进展，这是视觉和语言领域的活跃话题。但是，现有的方法倾向于产生过度笼统的字幕，并由一些最常见的单词/短语组成，从而导致不准确和无法区分的描述（见图1）。这主要是由于（i）传统训练目标的保守特征，该目标驱动模型以对相似图像产生正确但几乎没有区分性字幕，以及（ii）地面真实字幕的不均匀单词分布，这鼓励产生高度频繁的单词/短语，同时抑制较少的频繁但更频繁但更具体的混凝土。在这项工作中，我们提出了一个新型的全球局部判别目标，该目标是在参考模型之上提出的，以促进产生细粒度的描述性标题。具体来说，从全球角度来看，我们设计了一种新颖的全局判别约束，该约束将生成的句子拉开，以更好地从整个数据集中的所有其他图像中辨别出相应的图像。从当地的角度来看，提出了局部判别约束来增加注意力，以便强调较少但更具体的单词/短语，从而促进了标题的产生，从而更好地描述了给定图像的视觉细节。我们在广泛使用的MS-Coco数据集上评估了所提出的方法，在该数据集中，它的表现优于基线方法，并在现有领先方法上实现了竞争性能。我们还进行了自我退缩实验，以证明所提出方法的可区分性。

Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Figure 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题