论文标题
视觉对话的模态平衡模型
Modality-Balanced Models for Visual Dialogue
论文作者
论文摘要
视觉对话框任务需要一个模型来利用图像和对话上下文信息,以生成对话的下一个响应。但是,通过手动分析,我们发现只能查看图像而无需访问上下文历史记录,而其他人仍然需要对话上下文来预测正确的答案,可以回答大量的对话问题。我们证明,由于这个原因,以前的联合模式(历史和图像)模型过度依赖,并且更容易记住对话历史记录(例如,通过在上下文信息中提取某些关键字或模式来提取某些关键字或模式),而仅图像模型则更具概括性(因为它们无法记住或从历史记录中提取重要的答案),并且在多元范围内进行了多种命令的纠正率(均为折叠率)。因此,该观察结果鼓励我们明确维护两个模型,即仅图像模型和图像历史模型,并结合其互补能力,以获得更平衡的多模式模型。我们通过与共享参数的集合和共识辍学融合提供了两个模型集成的多种方法。从经验上讲,我们的模型在2019年视觉对话挑战挑战赛(NDCG的排名第3和高度平衡)上取得了良好的成果,并且在大多数指标上的视觉对话挑战挑战赛的获胜者都大大优于大多数指标。
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a large number of conversational questions can be answered by only looking at the image without any access to the context history, while others still need the conversation context to predict the correct answers. We demonstrate that due to this reason, previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history (e.g., by extracting certain keywords or patterns in the context information), whereas image-only models are more generalizable (because they cannot memorize or extract keywords from history) and perform substantially better at the primary normalized discounted cumulative gain (NDCG) task metric which allows multiple correct answers. Hence, this observation encourages us to explicitly maintain two models, i.e., an image-only model and an image-history joint model, and combine their complementary abilities for a more balanced multimodal model. We present multiple methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters. Empirically, our models achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and high balance across metrics), and substantially outperform the winner of the Visual Dialog challenge 2018 on most metrics.