论文标题
对比和分类:培训强大的VQA模型
Contrast and Classify: Training Robust VQA Models
论文作者
论文摘要
最近的视觉问题回答(VQA)模型在VQA基准上显示出令人印象深刻的性能,但对输入问题中的小语言变化仍然很敏感。现有方法通过通过视觉问题生成模型或对抗性扰动的问题释义来扩展数据集来解决这一问题。这些方法使用合并的数据来最大程度地减少标准的跨透镜损失来学习答案分类器。为了更有效地利用增强数据,我们以对比度学习的最新成功为基础。我们提出了一种新型的训练范式(结论),以优化跨凝性和对比损失。对比损失鼓励表示对问题的语言变化具有鲁棒性,而交叉渗透损失则保留了答案预测的表示的歧视能力。 我们发现,优化这两种损失(交替或共同)是有效训练的关键。在VQA重新塑料基准上,该基准测量了VQA模型在人类问题的答案中的答案一致性,结论将共识得分提高了1.63%。此外,在标准VQA 2.0基准下,我们将VQA准确度提高了0.78%。我们还表明,结论对所使用的数据提升策略的类型不可知。
Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question paraphrases from visual question generation models or adversarial perturbations. These approaches use the combined data to learn an answer classifier by minimizing the standard cross-entropy loss. To more effectively leverage augmented data, we build on the recent success in contrastive learning. We propose a novel training paradigm (ConClaT) that optimizes both cross-entropy and contrastive losses. The contrastive loss encourages representations to be robust to linguistic variations in questions while the cross-entropy loss preserves the discriminative power of representations for answer prediction. We find that optimizing both losses -- either alternately or jointly -- is key to effective training. On the VQA-Rephrasings benchmark, which measures the VQA model's answer consistency across human paraphrases of a question, ConClaT improves Consensus Score by 1 .63% over an improved baseline. In addition, on the standard VQA 2.0 benchmark, we improve the VQA accuracy by 0.78% overall. We also show that ConClaT is agnostic to the type of data-augmentation strategy used.