GSRFORMER：扎根的情况识别变压器，具有替代语义注意的精炼

论文标题

GSRFORMER：扎根的情况识别变压器，具有替代语义注意的精炼

GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

论文作者

Cheng, Zhi-Qi, Dai, Qi, Li, Siyao, Mitamura, Teruko, Hauptmann, Alexander G.

论文摘要

扎根的情况识别（GSR）旨在生成图像的结构化语义摘要，以了解“类人”事件的理解。具体而言，GSR任务不仅检测出显着的活动动词（例如购买），而且还可以预测所有相应的语义角色（例如代理和商品）。受对象检测和图像字幕任务的启发，现有方法通常采用两个阶段框架：1）检测活动动词，然后2）基于检测到的动词来预测语义角色。显然，这个不合逻辑的框架构成了语义理解的巨大障碍。首先，仅没有语义角色的前检测动词不可避免地无法区分许多类似的日常活动（例如，发行和捐赠，买卖）。其次，以封闭的自动回火方式预测语义角色几乎无法利用动词和角色之间的语义关系。为此，在本文中，我们提出了一个新颖的两阶段框架，该框架着重于在动词和角色中利用这种双向关系。在第一阶段，我们没有预测动词，而是推迟检测步骤并假设一个伪标记，其中每个相应的语义角色都从图像中学到了每个相应的语义角色的中间表示。在第二阶段，我们利用变压器层发掘动词和语义角色内的潜在语义关系。借助一组支持图像，替代学习方案旨在同时优化结果：使用与图像相对应的名词更新动词，并使用支持图像中的动词更新名词。关于挑战性SWIG基准测试的广泛实验结果表明，我们翻新的框架在各种指标下的表现都优于其他最先进的方法。

Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like" event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all corresponding semantic roles (e.g. agent and goods). Inspired by object detection and image captioning tasks, existing methods typically employ a two-stage framework: 1) detect the activity verb, and then 2) predict semantic roles based on the detected verb. Obviously, this illogical framework constitutes a huge obstacle to semantic understanding. First, pre-detecting verbs solely without semantic roles inevitably fails to distinguish many similar daily activities (e.g., offering and giving, buying and selling). Second, predicting semantic roles in a closed auto-regressive manner can hardly exploit the semantic relations among the verb and roles. To this end, in this paper we propose a novel two-stage framework that focuses on utilizing such bidirectional relations within verbs and roles. In the first stage, instead of pre-detecting the verb, we postpone the detection step and assume a pseudo label, where an intermediate representation for each corresponding semantic role is learned from images. In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles. With the help of a set of support images, an alternate learning scheme is designed to simultaneously optimize the results: update the verb using nouns corresponding to the image, and update nouns using verbs from support images. Extensive experimental results on challenging SWiG benchmarks show that our renovated framework outperforms other state-of-the-art methods under various metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题