Egoshots，一种自我视觉的生命数据集和语义保真度度量，以评估图像字幕模型中的多样性

论文标题

Egoshots，一种自我视觉的生命数据集和语义保真度度量，以评估图像字幕模型中的多样性

Egoshots, an ego-vision life-logging dataset and semantic fidelity metric to evaluate diversity in image captioning models

论文作者

Agarwal, Pranav, Betancourt, Alejandro, Panagiotou, Vana, Díaz-Rodríguez, Natalia

论文摘要

图像字幕模型已经能够生成语法正确和人类可理解的句子。但是，大多数字幕传达了有限的信息，因为所使用的模型是在没有标题日常生活中所有可能对象的数据集上训练的。由于缺乏先前的信息，大多数字幕都偏向现场中存在的几个物体，因此限制了它们在日常生活中的用法。在本文中，我们试图显示当前现有的图像字幕模型的偏见性质，并提出一个新的图像字幕数据集Egoshots，该数据集由978个现实生活中的图像组成，没有标题。我们进一步利用艺术预训练的图像字幕和对象识别网络来注释我们的图像并显示现有作品的局限性。此外，为了评估生成的字幕的质量，我们提出了一个新的图像标题标题，基于对象的语义保真度（SF）。现有的图像字幕指标只能在其相应注释的情况下评估标题；但是，SF允许评估为图像生成的字幕而无需注释，从而使其对于现实生活中产生的字幕非常有用。

Image captioning models have been able to generate grammatically correct and human understandable sentences. However most of the captions convey limited information as the model used is trained on datasets that do not caption all possible objects existing in everyday life. Due to this lack of prior information most of the captions are biased to only a few objects present in the scene, hence limiting their usage in daily life. In this paper, we attempt to show the biased nature of the currently existing image captioning models and present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions. We further exploit the state of the art pre-trained image captioning and object recognition networks to annotate our images and show the limitations of existing works. Furthermore, in order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF). Existing image captioning metrics can evaluate a caption only in the presence of their corresponding annotations; however, SF allows evaluating captions generated for images without annotations, making it highly useful for real life generated captions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题