论文标题
用于场景吸引对话系统的多步联合风格注意网络
Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System
论文作者
论文摘要
了解动态场景和对话环境以与用户交谈一直在对多模式对话系统的挑战。第8个对话框系统技术挑战(DSTC8)提出了一个视觉视觉场景吸引对话框(AVSD)任务,其中包含多种模式,包括音频,视觉和语言,以评估对话系统如何理解对用户的不同方式和响应。在本文中,我们提出了一个基于复发性神经网络(RNN)的多步联合感重网络(JMAN),以理解视频。我们的模型执行了多步的注意机制,并在每个推理过程中共同考虑视觉和文本表示,以更好地整合来自两种不同方式的信息。与AVSD组织者发布的基线相比,我们的模型在Rouge-L得分和苹果酒评分方面的相对相对12.1%和22.4%。
Understanding dynamic scenes and dialogue contexts in order to converse with users has been challenging for multimodal dialogue systems. The 8-th Dialog System Technology Challenge (DSTC8) proposed an Audio Visual Scene-Aware Dialog (AVSD) task, which contains multiple modalities including audio, vision, and language, to evaluate how dialogue systems understand different modalities and response to users. In this paper, we proposed a multi-step joint-modality attention network (JMAN) based on recurrent neural network (RNN) to reason on videos. Our model performs a multi-step attention mechanism and jointly considers both visual and textual representations in each reasoning process to better integrate information from the two different modalities. Compared to the baseline released by AVSD organizers, our model achieves a relative 12.1% and 22.4% improvement over the baseline on ROUGE-L score and CIDEr score.