论文标题
走向无监督的视觉推理:现成的功能是否知道如何推理?
Towards Unsupervised Visual Reasoning: Do Off-The-Shelf Features Know How to Reason?
论文作者
论文摘要
视觉表示学习的最新进展允许建立大量强大的现成功能,这些功能可用于众多下游任务。这项工作旨在评估这些功能如何保留有关对象的信息,例如它们的空间位置,视觉特性和相对关系。我们建议通过在视觉推理的背景下评估它们来做到这一点,在这种情况下,具有复杂关系和不同属性的多个对象正在发挥作用。更具体地说,我们介绍了一项协议,以评估视觉表达的视觉表述。为了将视觉特征从推理中解脱出来,我们设计了一个基于注意力的推理模块,该模块受到以要评估的冷冻视觉表示的训练,具有类似于依靠浅网络的标准功能评估的精神。我们将两种类型的视觉表示形式,密集提取的本地特征和以对象为中心的特征与使用地面真相的完美图像表示的性能进行了比较。我们的主要发现是两个方面。首先,尽管在经典代理任务上表现出色,但这种表示的表现却缺乏解决复杂的推理问题。其次,以对象为中心的功能更好地保留执行视觉推理所需的关键信息。在我们提出的框架中,我们展示了如何从方法上进行此评估。
Recent advances in visual representation learning allowed to build an abundance of powerful off-the-shelf features that are ready-to-use for numerous downstream tasks. This work aims to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module which is trained on the frozen visual representations to be evaluated, in a spirit similar to standard feature evaluations relying on shallow networks. We compare two types of visual representations, densely extracted local features and object-centric ones, against the performances of a perfect image representation using ground truth. Our main findings are two-fold. First, despite excellent performances on classical proxy tasks, such representations fall short for solving complex reasoning problem. Second, object-centric features better preserve the critical information necessary to perform visual reasoning. In our proposed framework we show how to methodologically approach this evaluation.