在多任务基准上对文本对图像模型的人体评估

论文标题

在多任务基准上对文本对图像模型的人体评估

Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

论文作者

Petsiuk, Vitali, Siemenn, Alexander E., Surbehera, Saisamrit, Chin, Zad, Tyser, Keith, Hunter, Gregory, Raghavan, Arvind, Hicke, Yann, Plummer, Bryan A., Kerret, Ori, Buonassisi, Tonio, Saenko, Kate, Solar-Lezama, Armando, Drori, Iddo

论文摘要

我们提供了一个新的多任务基准，用于评估文本对图像模型。我们进行人类评估，比较最常见的开源（稳定扩散）和商业（DALL-E 2）模型。二十个计算机科学AI研究生在三个任务上评估了这两种模型，分别是三个难度级别，每个提示都提供了3,600个评分。文本到图像的生成已经快速发展，以至于许多最近的模型已经证明了他们为各种提示创建现实的高分辨率图像的能力。但是，当前的文本对图像方法和视觉理解中更广泛的研究仍然与复杂的文本提示相斗争，这些提示包含许多具有多个属性和关系的对象。我们介绍了一个新的文本对图像基准，该基准包含一套三十二个任务，这些任务捕获了模型处理文本提示的不同功能的能力。例如，要求模型生成不同数量的相同对象，以测量其计数或提供文本提示的能力，这些对象具有多个对象，每个对象都有不同的属性来识别其正确匹配对象和属性的能力。我们的新的多任务基准不是主观评估文本对图像的结果，而是由三个难度级别（简单，中和硬）和每个生成的图像的人类评分组成的挑战任务。

We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid progress to the point that many recent models have demonstrated their ability to create realistic high-resolution images for various prompts. However, current text-to-image methods and the broader body of research in vision-language understanding still struggle with intricate text prompts that contain many objects with multiple attributes and relationships. We introduce a new text-to-image benchmark that contains a suite of thirty-two tasks over multiple applications that capture a model's ability to handle different features of a text prompt. For example, asking a model to generate a varying number of the same object to measure its ability to count or providing a text prompt with several objects that each have a different attribute to identify its ability to match objects and attributes correctly. Rather than subjectively evaluating text-to-image results on a set of prompts, our new multi-task benchmark consists of challenge tasks at three difficulty levels (easy, medium, and hard) and human ratings for each generated image.

下载PDF全文

下载文献需遵守相关版权规定

论文标题