VIGAT：使用分解图形注意网络在视频中自下而上的事件识别和解释

论文标题

VIGAT：使用分解图形注意网络在视频中自下而上的事件识别和解释

ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network

论文作者

Gkalelis, Nikolaos, Daskalakis, Dimitrios, Mezaris, Vasileios

论文摘要

在本文中，提出了一种纯粹的自下而上的方法，称为VigAt，该方法将对象检测器与视觉变压器（VIT）骨干网络一起得出对象和框架功能，并提出了一个头网络，以处理这些功能，以处理视频中事件识别和解释的任务。 VIGAT头由沿空间和时间维度分解的图形注意网络（GAT）组成，以便有效捕获对象或帧之间的局部和长期依赖性。此外，使用从各个GAT块的邻接矩阵得出的加权内（wids），我们表明所提出的体系结构可以识别解释网络决策的最显着的对象和框架。进行了一项全面的评估研究，表明所提出的方法在三个大型公开视频数据集（FCVID，Mini-Kinetics，ActivityNet）上提供了最先进的结果。

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, Mini-Kinetics, ActivityNet).

下载PDF全文

下载文献需遵守相关版权规定

论文标题