论文标题
视频中的熵驱动的无监督关键点表示学习
Entropy-driven Unsupervised Keypoint Representation Learning in Videos
论文作者
论文摘要
从视频中提取信息的表示是有效地学习各种下游任务的基础。我们提出了一种新颖的方法,可以从视频中学习有意义的表示形式,以利用图像空间熵(ISE)的概念来量化图像中的每个像素信息。我们认为,像素社区的\ textit {local Entropy}及其时间演化创建了有价值的内在监督信号,以学习突出特征。在这个想法的基础上,我们将视觉特征抽象成一个动态信息发射器的关键点的简洁表示,并设计了一个深度学习模型,该模型从视频框架中学习,纯粹无监督,空间和时间上一致的表示\ textit {直接}。从本地熵计算出的两个原始信息理论损失,指导我们的模型发现一致的关键点表示。损失最大程度地提高了关键点涵盖的空间信息,并损失了随着时间的推移而优化关键点的信息传输的损失。我们将我们的关键点表示与强大的基线进行比较,以完成各种下游任务,例如学习对象动态。我们的经验结果显示了我们信息驱动的关键点的卓越性能,这些按键可以解决静态和动态对象的出勤挑战,或者突然进入并离开现场的对象。
Extracting informative representations from videos is fundamental for effectively learning various downstream tasks. We present a novel approach for unsupervised learning of meaningful representations from videos, leveraging the concept of image spatial entropy (ISE) that quantifies the per-pixel information in an image. We argue that \textit{local entropy} of pixel neighborhoods and their temporal evolution create valuable intrinsic supervisory signals for learning prominent features. Building on this idea, we abstract visual features into a concise representation of keypoints that act as dynamic information transmitters, and design a deep learning model that learns, purely unsupervised, spatially and temporally consistent representations \textit{directly} from video frames. Two original information-theoretic losses, computed from local entropy, guide our model to discover consistent keypoint representations; a loss that maximizes the spatial information covered by the keypoints and a loss that optimizes the keypoints' information transportation over time. We compare our keypoint representation to strong baselines for various downstream tasks, \eg, learning object dynamics. Our empirical results show superior performance for our information-driven keypoints that resolve challenges like attendance to static and dynamic objects or objects abruptly entering and leaving the scene.