寻找变化：学习对象状态和未修剪网络视频中的状态修改动作

论文标题

寻找变化：学习对象状态和未修剪网络视频中的状态修改动作

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

论文作者

Souček, Tomáš, Alayrac, Jean-Baptiste, Miech, Antoine, Laptev, Ivan, Sivic, Josef

论文摘要

人类的行动通常会引起物体状态的变化，例如“切苹果”，“清洁鞋”或“倒咖啡”。在本文中，我们试图在暂时地将对象状态（例如“空”和“完整”杯）以及相应的状态修改动作（“倒咖啡”）一起在长期未经保育的视频中，并以最小的监督为准。这项工作的贡献是三倍。首先，我们开发了一个自我监督的模型，用于从互联网上一组未经保育的视频中共同学习状态修改的动作以及相应的对象状态。该模型由因果订购信号自我监督，即初始对象状态$ \ rightarrow $操纵诉讼$ \ rightarrow $ end State。其次，为了应对嘈杂的未经培训数据，我们的模型包含了一个由少数注释的静止图像监督的噪声自适应加权模块，从而可以在训练过程中有效地过滤无关的视频。第三，我们收集了一个新的数据集，其中包含超过2600个小时的视频和3.4万个对象状态的变化，并手动注释该数据的一部分以验证我们的方法。我们的结果表明，视频中的动作和对象状态识别方面的先前工作都有很大的改进。

Human actions often induce changes of object states such as "cutting an apple", "cleaning shoes" or "pouring coffee". In this paper, we seek to temporally localize object states (e.g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision. The contributions of this work are threefold. First, we develop a self-supervised model for jointly learning state-modifying actions together with the corresponding object states from an uncurated set of videos from the Internet. The model is self-supervised by the causal ordering signal, i.e. initial object state $\rightarrow$ manipulating action $\rightarrow$ end state. Second, to cope with noisy uncurated training data, our model incorporates a noise adaptive weighting module supervised by a small number of annotated still images, that allows to efficiently filter out irrelevant videos during training. Third, we collect a new dataset with more than 2600 hours of video and 34 thousand changes of object states, and manually annotate a part of this data to validate our approach. Our results demonstrate substantial improvements over prior work in both action and object state-recognition in video.

下载PDF全文

下载文献需遵守相关版权规定

论文标题