上下文可解释的视频表示：基于人类的理解

论文标题

上下文可解释的视频表示：基于人类的理解

Contextual Explainable Video Representation: Human Perception-based Understanding

论文作者

Vo, Khoa, Yamazaki, Kashu, Nguyen, Phong X., Nguyen, Phat, Luu, Khoa, Le, Ngan

论文摘要

视频理解是一个越来越多的领域，也是一个激烈研究的主题，其中包括许多有趣的任务来理解空间和时间信息，例如动作检测，行动识别，视频字幕，视频检索。视频理解中最具挑战性的问题之一是处理特征提取，即由于不受约束的视频的长时间且复杂的时间结构，从给定的未经修剪视频中提取上下文视觉表示。与现有的方法不同，现有方法将预训练的骨干网络作为黑框来提取视觉表示，我们的方法旨在用可解释的机制提取最大的上下文信息。正如我们观察到的那样，人类通常通过三个主要因素（即参与者，相关对象和周围环境）之间的相互作用来感知视频。因此，设计一个可以捕获每个因素并建模它们之间的关系的可解释的视频表示提取非常重要。在本文中，我们讨论了将人类感知过程纳入建模参与者，对象和环境的方法。我们选择视频段落字幕和时间动作检测，以说明基于人类感知的视野表示在视频理解中的有效性。源代码可在https://github.com/uark-aicv/video_representation上公开获得。

Video understanding is a growing field and a subject of intense research, which includes many interesting tasks to understanding both spatial and temporal information, e.g., action detection, action recognition, video captioning, video retrieval. One of the most challenging problems in video understanding is dealing with feature extraction, i.e. extract contextual visual representation from given untrimmed video due to the long and complicated temporal structure of unconstrained videos. Different from existing approaches, which apply a pre-trained backbone network as a black-box to extract visual representation, our approach aims to extract the most contextual information with an explainable mechanism. As we observed, humans typically perceive a video through the interactions between three main factors, i.e., the actors, the relevant objects, and the surrounding environment. Therefore, it is very crucial to design a contextual explainable video representation extraction that can capture each of such factors and model the relationships between them. In this paper, we discuss approaches, that incorporate the human perception process into modeling actors, objects, and the environment. We choose video paragraph captioning and temporal action detection to illustrate the effectiveness of human perception based-contextual representation in video understanding. Source code is publicly available at https://github.com/UARK-AICV/Video_Representation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题