论文标题
通过剪贴指导组优化的独特图像字幕
Distinctive Image Captioning via CLIP Guided Group Optimization
论文作者
论文摘要
图像字幕模型通常是根据人体注释的地面真相字幕训练的,该字幕可能会产生准确但通用的字幕。在本文中,我们专注于生成独特的字幕,这些字幕可以将目标图像与其他相似图像区分开。为了评估字幕的独特性,我们介绍了一系列指标,这些指标使用大型视觉语言预训练模型剪辑来量化独特性。为了进一步提高字幕模型的独特性,我们提出了一种简单有效的训练策略,该策略通过将目标图像与相似的图像组进行比较并优化组嵌入差距来训练模型。在各种基线模型上进行了广泛的实验,以证明我们策略的广泛适用性以及公制结果与人类评估的一致性。通过将最佳模型的性能与现有的最新模型进行比较,我们声称我们的模型实现了针对独特性目标的新最新。
Image captioning models are usually trained according to human annotated ground-truth captions, which could generate accurate but generic captions. In this paper, we focus on generating distinctive captions that can distinguish the target image from other similar images. To evaluate the distinctiveness of captions, we introduce a series of metrics that use large-scale vision-language pre-training model CLIP to quantify the distinctiveness. To further improve the distinctiveness of captioning models, we propose a simple and effective training strategy that trains the model by comparing target image with similar image group and optimizing the group embedding gap. Extensive experiments are conducted on various baseline models to demonstrate the wide applicability of our strategy and the consistency of metric results with human evaluation. By comparing the performance of our best model with existing state-of-the-art models, we claim that our model achieves new state-of-the-art towards distinctiveness objective.