视觉语言问题回答（VLQA）挑战

论文标题

视觉语言问题回答（VLQA）挑战

Visuo-Linguistic Question Answering (VLQA) Challenge

论文作者

Sampat, Shailaja Keyur, Yang, Yezhou, Baral, Chitta

论文摘要

理解图像和文本是认知和构建先进人工智能（AI）系统的重要方面。作为一个社区，我们分别实现了对语言和视觉领域的良好基准，但是共同推理仍然是最先进的计算机视觉和自然语言处理（NLP）系统的挑战。我们提出了一项新的任务，以推导给定图像文本模式的联合推断，并在一个问题回答设置中汇编Visuo-lighuistic Question答案（VLQA）挑战语料库。每个数据集项目都由图像和阅读段落组成，其中旨在将视觉和文本信息结合起来，即忽略这两种方式都将使问题无法回答。我们首先探索最好的现有视觉架构来解决VLQA子集并表明它们无法很好地推理。然后，我们开发了一种模块化方法，其基线性能稍好，但它仍然远远落后于人类表现。我们认为，VLQA将是在视觉语言背景下推理的好基准。数据集，代码和排行榜可在https://shailaja183.github.io/vlqa/上找到。

Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题