论文标题
一个不适合所有人!关于视觉和语言任务的视觉编码的互补性
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
论文作者
论文摘要
当前旨在解决视觉和语言(V+L)任务的多模型,主要以功能提取器为主。尽管许多在不同的数据和目标上培训的不同体系结构的VES公开可用,但它们并不是为下游V+L任务而设计的。但是,大多数当前的工作都假定\ textit {单}预训练的VE可以用作通用用途编码器。在这项工作中,我们专注于分析,并旨在了解存储在不同VES中的信息是否是互补的,即,如果为模型提供多个VES的功能可以改善目标任务的性能以及如何组合。我们在六个下游V+L任务上详尽地尝试了三个流行的VES,并分析了注意力和Ve-Dropout模式。我们的分析表明,不同的VE相互补充,从而改善了下游V+L任务性能,其中改进并不是由于简单的整体效应引起的(即,在增加编码器数量时的性能并不总是会提高。我们证明,未来不是\ textIt {重新启动}的未来VE,但明确\ textit {design}用于V+L任务,具有改善目标V+L任务的性能的潜力。
Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs -- of different architectures, trained on different data and objectives -- are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a \textit{single} pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task, and how they are combined. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not \textit{repurposed}, but explicitly \textit{designed} for V+L tasks, have the potential of improving performance on the target V+L tasks.