联合视觉 - 周期性嵌入，以无监督的动作学习未经修剪的序列

论文标题

联合视觉 - 周期性嵌入，以无监督的动作学习未经修剪的序列

Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences

论文作者

VidalMata, Rosaura G., Scheirer, Walter J., Kukleva, Anna, Cox, David, Kuehne, Hilde

论文摘要

在行动识别领域中，了解未修剪的视频中复杂活动的结构是一项艰巨的任务。这里的一个问题是，此任务通常需要大量的手工注销的分钟或什至一个小时的视频数据，但是注释此类数据非常耗时，并且不能轻易自动化或缩放。为了解决这个问题，本文提出了一种基于联合视觉速率嵌入空间的未修剪视频序列中无人监督学习的方法。为此，我们将基于时间连续函数的预测U-NET体系结构的视觉嵌入结合在一起。最终的表示空间允许根据其视觉和时间外观检测相关的动作簇。在三个标准基准数据集，早餐动作，Inria YouTube教学视频和50个沙拉上评估了所提出的方法。我们表明，所提出的方法能够从连续视频框架中存在的视觉提示中提供有意义的视觉和时间嵌入，并且适合于无监督的操作时间分割的任务。

Understanding the structure of complex activities in untrimmed videos is a challenging task in the area of action recognition. One problem here is that this task usually requires a large amount of hand-annotated minute- or even hour-long video data, but annotating such data is very time consuming and can not easily be automated or scaled. To address this problem, this paper proposes an approach for the unsupervised learning of actions in untrimmed video sequences based on a joint visual-temporal embedding space. To this end, we combine a visual embedding based on a predictive U-Net architecture with a temporal continuous function. The resulting representation space allows detecting relevant action clusters based on their visual as well as their temporal appearance. The proposed method is evaluated on three standard benchmark datasets, Breakfast Actions, INRIA YouTube Instructional Videos, and 50 Salads. We show that the proposed approach is able to provide a meaningful visual and temporal embedding out of the visual cues present in contiguous video frames and is suitable for the task of unsupervised temporal segmentation of actions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题