CAISE：图像搜索和编辑的对话代理

论文标题

CAISE：图像搜索和编辑的对话代理

CAISE: Conversational Agent for Image Search and Editing

论文作者

Kim, Hyounghun, Kim, Doo Soon, Yoon, Seunghyun, Dernoncourt, Franck, Bui, Trung, Bansal, Mohit

论文摘要

随着用户对表达的渴望也在增加，对图像编辑的需求一直在增加。但是，对于大多数用户而言，图像编辑工具不容易使用，因为这些工具需要某些专业知识，并具有复杂的接口。因此，用户可能需要某人来帮助编辑自己的图像，但是为每个用户提供个人敬业的人类助手是不可能扩展的。因此，需要自动化的图像编辑助手系统。此外，用户需要更多的图像来源，以供多种图像编辑作品，并且将图像搜索功能集成到编辑工具中是该需求的潜在补救措施。因此，我们为图像搜索和编辑（CAISE）提出了一个自动对话代理的数据集。据我们所知，这是第一个提供对话映像搜索和编辑注释的数据集，代理商与用户进行扎根对话，并帮助他们根据其请求进行搜索和编辑。为了构建这样的系统，我们首先收集图像搜索，并在成对的注释者之间进行编辑对话。助手通知者配备了自定义的图像搜索和编辑工具，以解决用户通知者的请求。将助手通知者与工具进行的功能记录为可执行的命令，从而使训练有素的系统可用于现实世界应用程序执行。我们还为此任务介绍了一个发电机 - 提取器基线模型，该模型可以自适应地为可执行命令选择下一象征的源（即从词汇或文本/视觉上下文中）。这是一个强大的起点，同时仍留下巨大的人机性能差距，以实现未来的工作。我们的代码和数据集可公开可用：https：//github.com/hyounghk/caise

Demand for image editing has been increasing as users' desire for expression is also increasing. However, for most users, image editing tools are not easy to use since the tools require certain expertise in photo effects and have complex interfaces. Hence, users might need someone to help edit their images, but having a personal dedicated human assistant for every user is impossible to scale. For that reason, an automated assistant system for image editing is desirable. Additionally, users want more image sources for diverse image editing works, and integrating an image search functionality into the editing tool is a potential remedy for this demand. Thus, we propose a dataset of an automated Conversational Agent for Image Search and Editing (CAISE). To our knowledge, this is the first dataset that provides conversational image search and editing annotations, where the agent holds a grounded conversation with users and helps them to search and edit images according to their requests. To build such a system, we first collect image search and editing conversations between pairs of annotators. The assistant-annotators are equipped with a customized image search and editing tool to address the requests from the user-annotators. The functions that the assistant-annotators conduct with the tool are recorded as executable commands, allowing the trained system to be useful for real-world application execution. We also introduce a generator-extractor baseline model for this task, which can adaptively select the source of the next token (i.e., from the vocabulary or from textual/visual contexts) for the executable command. This serves as a strong starting point while still leaving a large human-machine performance gap for useful future work. Our code and dataset are publicly available at: https://github.com/hyounghk/CAISE

下载PDF全文

下载文献需遵守相关版权规定

论文标题