视觉对话框的多视图注意网络

论文标题

视觉对话框的多视图注意网络

Multi-View Attention Network for Visual Dialog

论文作者

Park, Sungjin, Whang, Taesun, Yoon, Yeochan, Lim, Heuiseok

论文摘要

视觉对话是一项具有挑战性的视觉语言任务，其中回答了一系列以视觉为基础的问题。为了解决“视觉对话”任务，需要对各种多模式输入（例如问题，对话框历史记录和图像）有高级的理解。具体而言，代理必须对1）确定问题的语义意图，以及2）在异质模态输入之间与问题相关的文本和视觉内容。在本文中，我们提出了多视图注意网络（MVAN），该网络基于注意机制利用了有关异质输入的多种观点。 MVAN通过两个互补的模块（即主题聚合和上下文匹配）有效地捕获了与对话记录中的问题相关的信息，并通过顺序对齐过程（即模态对准）构建多模式表示。 Visdial V1.0数据集的实验结果显示了我们提出的模型的有效性，该模型在所有评估指标方面都优于先前的最新方法。

Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, and image) is required. Specifically, it is necessary for an agent to 1) determine the semantic intent of question and 2) align question-relevant textual and visual contents among heterogeneous modality inputs. In this paper, we propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two complementary modules (i.e., Topic Aggregation and Context Matching), and builds multimodal representations through sequential alignment processes (i.e., Modality Alignment). Experimental results on VisDial v1.0 dataset show the effectiveness of our proposed model, which outperforms the previous state-of-the-art methods with respect to all evaluation metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题