通过：通过运动重新定位的观察不变的骨架动作表示学习

论文标题

通过：通过运动重新定位的观察不变的骨架动作表示学习

ViA: View-invariant Skeleton Action Representation Learning via Motion Retargeting

论文作者

Yang, Di, Wang, Yaohui, Dantcheva, Antitza, Garattoni, Lorenzo, Francesca, Gianpiero, Bremond, Francois

论文摘要

当前用于骨架动作表示的自我监督方法学习通常集中在受约束的场景上，其中在实验室环境中记录了视频和骨骼数据。在处理现实世界视频中估计的骨骼数据时，由于受试者和相机观点之间的差异很大，因此此类方法的性能很差。为了解决这个问题，我们通过新颖的视图自动编码器介绍了自我监视的骨架动作表示学习。通过利用不同的人类表演者在借口任务之间进行运动重新定位，以便在2D或3D骨架序列的视觉表示之上删除潜在动作特异性的“运动”特征。这种“运动”功能对于骨架几何和摄像机视图是不变的，并允许通过辅助跨观察和跨视图动作分类任务。我们进行了一项研究，重点是针对基于骨架的动作识别的转移学习，并在现实世界中的自我监督预训练（例如Posetics）。 Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy, not only on 3D laboratory datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world datasets where only 2D data are accurately estimated, e.g., Toyota Smarthome, UAV-Human and Penn Action.

Current self-supervised approaches for skeleton action representation learning often focus on constrained scenarios, where videos and skeleton data are recorded in laboratory settings. When dealing with estimated skeleton data in real-world videos, such methods perform poorly due to the large variations across subjects and camera viewpoints. To address this issue, we introduce ViA, a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. ViA leverages motion retargeting between different human performers as a pretext task, in order to disentangle the latent action-specific `Motion' features on top of the visual representation of a 2D or 3D skeleton sequence. Such `Motion' features are invariant to skeleton geometry and camera view and allow ViA to facilitate both, cross-subject and cross-view action classification tasks. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data (e.g., Posetics). Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy, not only on 3D laboratory datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world datasets where only 2D data are accurately estimated, e.g., Toyota Smarthome, UAV-Human and Penn Action.

下载PDF全文

下载文献需遵守相关版权规定

论文标题