论文标题
视频问答不变的基础
Invariant Grounding for Video Question Answering
论文作者
论文摘要
视频问题回答(videoqa)是回答有关视频问题的任务。其核心是了解视频中的视觉场景与所讨论的语言语义中的一致性,以产生答案。在领先的VideoQA模型中,典型的学习目标,经验风险最小化(ERM)在视频问题对和答案之间作为对齐方式的浅表相关性锁定。但是,ERM可能是有问题的,因为它倾向于过度探索问题 - 意外情况之间的虚假相关性,而不是检查关键问题场景的因果关系。结果,VideoQA模型遭受了不可靠的推理。在这项工作中,我们首先对VideoQA进行了因果关系,并认为不变的接地是排除虚假相关性的关键。为此,我们提出了一个新的学习框架,为VideoQA(IGV)不变的基础,以基础问题 - 关键的场景,其因果关系与答案的因果关系在补充的不同干预措施中是不变的。借助IGV,VideoQA模型被迫将答案过程免受虚假相关性的负面影响,从而显着提高了推理能力。在三个基准数据集上的实验以准确性,可见性和概括能力优于领先基线的优势。
Video Question Answering (VideoQA) is the task of answering questions about a video. At its core is understanding the alignments between visual scenes in video and linguistic semantics in question to yield the answer. In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers as the alignments. However, ERM can be problematic, because it tends to over-exploit the spurious correlations between question-irrelevant scenes and answers, instead of inspecting the causal effect of question-critical scenes. As a result, the VideoQA models suffer from unreliable reasoning. In this work, we first take a causal look at VideoQA and argue that invariant grounding is the key to ruling out the spurious correlations. Towards this end, we propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene, whose causal relations with answers are invariant across different interventions on the complement. With IGV, the VideoQA models are forced to shield the answering process from the negative influence of spurious correlations, which significantly improves the reasoning ability. Experiments on three benchmark datasets validate the superiority of IGV in terms of accuracy, visual explainability, and generalization ability over the leading baselines.