将图像场景结构通过对象令牌的框架 - 卷 - 卷流的一致性带入视频

论文标题

将图像场景结构通过对象令牌的框架 - 卷 - 卷流的一致性带入视频

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

论文作者

Ben-Avraham, Elad, Herzig, Roei, Mangalam, Karttikeya, Bar, Amir, Rohrbach, Anna, Karlinsky, Leonid, Darrell, Trevor, Globerson, Amir

论文摘要

最近的行动识别模型通过整合对象，它们的位置和互动来取得令人印象深刻的结果。但是，为每个框架获得密集的结构化注释是乏味且耗时的，使这些方法的训练昂贵且可扩展性较低。同时，如果可以在感兴趣的域内或之外使用一小部分带注释的图像，我们如何将这些图像用于下游任务的视频？我们提出了一个学习框架的结构（简称SVIT），该结构证明了仅在训练过程中使用的少量图像的结构可以改善视频模型。 SVIT依靠两个关键见解。首先，由于图像和视频都包含结构化信息，因此我们丰富了一个具有\ emph {对象令牌}的变压器模型，可以在图像和视频中使用。其次，视频中各个帧的场景表示应与静止图像的场景表示“对齐”。这是通过\ emph {frame-clip一致性}损失来实现的，该损失可确保图像和视频之间结构化信息的流动。我们探索了场景结构的特定实例化，即\ emph {手对象图}，由手和对象组成，其位置为节点，以及与边缘的触点/no-contact的物理关系。 SVIT在多个视频理解任务和数据集上显示出强大的性能改进。此外，它在EGO4D CVPR'22对象状态本地化挑战中获胜。对于代码和预算模型，请访问\ url {https://eladb3.github.io/svit/}的项目页面

Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how could we leverage these for a video downstream task? We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos. Second, the scene representations of individual frames in video should "align" with those of still images. This is achieved via a \emph{Frame-Clip Consistency} loss, which ensures the flow of structured information between images and videos. We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges. SViT shows strong performance improvements on multiple video understanding tasks and datasets. Furthermore, it won in the Ego4D CVPR'22 Object State Localization challenge. For code and pretrained models, visit the project page at \url{https://eladb3.github.io/SViT/}

下载PDF全文

下载文献需遵守相关版权规定

论文标题