通过多任务学习方法中使用语义框架信息来增强视觉问题的回答

论文标题

通过多任务学习方法中使用语义框架信息来增强视觉问题的回答

Augmenting Visual Question Answering with Semantic Frame Information in a Multitask Learning Approach

论文作者

Alizadeh, Mehrdad, Di Eugenio, Barbara

论文摘要

视觉问题回答（VQA）涉及提供有关图像的自然语言问题的答案。已经提出了几种深层神经网络方法，以端到端的方式对任务进行建模。尽管该任务基于视觉处理，但如果该问题侧重于动词描述的事件，则语言理解成分变得至关重要。我们的假设是，模型应意识到动词语义，如语义角色标签，参数类型和/或框架元素所表达的那样。不幸的是，不存在包含动词语义信息的VQA数据集。我们的第一个贡献是通过利用IMSITU注释来构建的新的VQA数据集（IMSITUVQA）。 IMSITU数据集由手动标记的图像组成，这些图像主要来自Framenet。其次，我们提出了一个多任务CNN-LSTM VQA模型，该模型学会了对答案和语义框架元素进行分类。我们的实验表明，语义框架元素分类有助于VQA系统避免响应不一致并改善性能。

Visual Question Answering (VQA) concerns providing answers to Natural Language questions about images. Several deep neural network approaches have been proposed to model the task in an end-to-end fashion. Whereas the task is grounded in visual processing, if the question focuses on events described by verbs, the language understanding component becomes crucial. Our hypothesis is that models should be aware of verb semantics, as expressed via semantic role labels, argument types, and/or frame elements. Unfortunately, no VQA dataset exists that includes verb semantic information. Our first contribution is a new VQA dataset (imSituVQA) that we built by taking advantage of the imSitu annotations. The imSitu dataset consists of images manually labeled with semantic frame elements, mostly taken from FrameNet. Second, we propose a multitask CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements. Our experiments show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题