论文标题
图像级弱监督视觉概念识别的不属不成对图像字幕
Unpaired Image Captioning by Image-level Weakly-Supervised Visual Concept Recognition
论文作者
论文摘要
未配对图像字幕(UIC)的目的是描述图像,而无需在训练阶段使用图像捕获对。尽管具有挑战性,但我们可以通过利用与视觉概念一致的训练图像来完成任务。大多数现有研究都使用现成的算法来获得视觉概念,因为用于培训的边界框(Bbox)标签或关系 - 三角标签的获取昂贵。为了解决昂贵的注释中的问题,我们提出了一种新的方法来实现具有成本效益的UIC。具体而言,我们采用图像级标签以弱监督的方式优化UIC模型。对于每个图像,我们假设只有图像级标签,而没有特定位置和数字。图像级标签用于训练弱监督的对象识别模型,以在图像中提取对象信息(例如,实例),并采用提取的实例来根据增强的图形神经网络(GNN)来推断不同对象之间的关系。与以前的方法相比,所提出的方法可相当甚至更好,而没有昂贵的注释成本。此外,我们设计了一个未识别的对象(UNO)损失,并结合了视觉概念奖励,以改善推断对象的对齐方式以及与图像的关系信息。它可以有效地缓解现有UIC模型遇到的有关具有不存在对象的句子的问题。据我们所知,这是仅基于图像级标签的UIC(WS-UIC)的弱监督视觉概念识别问题的首次尝试。已经进行了广泛的实验,以证明所提出的WS-UIC模型在可可数据集上实现了鼓舞人心的结果,同时大大降低了标签成本。
The goal of unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase. Although challenging, we except the task can be accomplished by leveraging a training set of images aligned with visual concepts. Most existing studies use off-the-shelf algorithms to obtain the visual concepts because the Bounding Box (BBox) labels or relationship-triplet labels used for the training are expensive to acquire. In order to resolve the problem in expensive annotations, we propose a novel approach to achieve cost-effective UIC. Specifically, we adopt image-level labels for the optimization of the UIC model in a weakly-supervised manner. For each image, we assume that only the image-level labels are available without specific locations and numbers. The image-level labels are utilized to train a weakly-supervised object recognition model to extract object information (e.g., instance) in an image, and the extracted instances are adopted to infer the relationships among different objects based on an enhanced graph neural network (GNN). The proposed approach achieves comparable or even better performance compared with previous methods without the expensive cost of annotations. Furthermore, we design an unrecognized object (UnO) loss combined with a visual concept reward to improve the alignment of the inferred object and relationship information with the images. It can effectively alleviate the issue encountered by existing UIC models about generating sentences with nonexistent objects. To the best of our knowledge, this is the first attempt to solve the problem of Weakly-Supervised visual concept recognition for UIC (WS-UIC) based only on image-level labels. Extensive experiments have been carried out to demonstrate that the proposed WS-UIC model achieves inspiring results on the COCO dataset while significantly reducing the cost of labeling.