论文标题
单眼视频中3D多人姿势估计的图形和时间卷积网络
Graph and Temporal Convolutional Networks for 3D Multi-person Pose Estimation in Monocular Videos
论文作者
论文摘要
尽管取得了最近的进展,但由于通常遇到的咬合,部分框架目标的人以及不准确的人检测而遇到的丢失信息的问题,因此,单眼视频的3D多人姿势估计仍然具有挑战性。为了解决这个问题,我们提出了一个新颖的框架,该框架集成了图形卷积网络(GCN)和时间卷积网络(TCN),以稳健地估算不需要摄像机参数的以摄像头的多人3D姿势。特别是,我们引入了人类接头GCN,该GCN与现有的GCN不同,它基于有向图,该图采用了2D姿势估计器的置信度得分来改善姿势估计结果。我们还引入了人骨GCN,该GCN对骨头连接进行了建模,并提供了更多人类关节的信息。这两个GCN共同估算了依据空间框架的3D姿势,并可以在目标框架中使用可见的关节和骨信息,以估计被阻塞或缺失的人为零件信息。为了进一步完善3D姿势估计,我们使用时间卷积网络(TCN)来执行时间和人类动力学约束。我们使用联合-TCN估算以人为中心的3D姿势,并提出速度-TCN来估计3D接头的速度,以确保连续帧中3D姿势估计的一致性。最后,为了估算多个人的3D人体姿势,我们提出了一个root-TCN,该root-TCN估计以相机为中心的3D姿势,而无需摄像机参数。定量和定性评估证明了该方法的有效性。
Despite the recent progress, 3D multi-person pose estimation from monocular videos is still challenging due to the commonly encountered problem of missing information caused by occlusion, partially out-of-frame target persons, and inaccurate person detection. To tackle this problem, we propose a novel framework integrating graph convolutional networks (GCNs) and temporal convolutional networks (TCNs) to robustly estimate camera-centric multi-person 3D poses that do not require camera parameters. In particular, we introduce a human-joint GCN, which, unlike the existing GCN, is based on a directed graph that employs the 2D pose estimator's confidence scores to improve the pose estimation results. We also introduce a human-bone GCN, which models the bone connections and provides more information beyond human joints. The two GCNs work together to estimate the spatial frame-wise 3D poses and can make use of both visible joint and bone information in the target frame to estimate the occluded or missing human-part information. To further refine the 3D pose estimation, we use our temporal convolutional networks (TCNs) to enforce the temporal and human-dynamics constraints. We use a joint-TCN to estimate person-centric 3D poses across frames, and propose a velocity-TCN to estimate the speed of 3D joints to ensure the consistency of the 3D pose estimation in consecutive frames. Finally, to estimate the 3D human poses for multiple persons, we propose a root-TCN that estimates camera-centric 3D poses without requiring camera parameters. Quantitative and qualitative evaluations demonstrate the effectiveness of the proposed method.