论文标题
旨在了解视觉扎根的语言生成中的样本差异:评估和观察
Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations
论文作者
论文摘要
视觉扎根的语言生成的一个主要挑战是构建可以在现实世界中很好地推广的强大基准数据集和模型。为此,至关重要的是要确保我们的评估协议是正确的,并且基准是可靠的。在这项工作中,我们着手设计一组实验,以了解视觉扎根的语言生成中一个重要但经常被忽略的问题:鉴于人类具有不同的实用程序和视觉关注,多参考数据集中的样本差异将如何影响模型的性能?从经验上讲,我们研究了几个多引用数据集以及相应的视觉和语言任务。我们表明,报告实验中的差异至关重要。人类生成的参考可能在不同的数据集/任务中差异很大,从而揭示了每个任务的性质;根据指标,苹果酒在系统上显示出比其他方差更大的差异。我们对每现有参考的评估阐明了将来可靠数据集的设计。
A major challenge in visually grounded language generation is to build robust benchmark datasets and models that can generalize well in real-world settings. To do this, it is critical to ensure that our evaluation protocols are correct, and benchmarks are reliable. In this work, we set forth to design a set of experiments to understand an important but often ignored problem in visually grounded language generation: given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? Empirically, we study several multi-reference datasets and corresponding vision-and-language tasks. We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task; that metric-wise, CIDEr has shown systematically larger variances than others. Our evaluations on reference-per-instance shed light on the design of reliable datasets in the future.