通过环顾静态场景，跟踪出现，并带有神经3D映射

论文标题

通过环顾静态场景，跟踪出现，并带有神经3D映射

Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping

论文作者

Harley, Adam W., Lakshmikanth, Shrinidhi K., Schydlo, Paul, Fragkiadaki, Katerina

论文摘要

我们假设可以在静态场景中环顾四周的代理可以学习适用于复杂动态场景中3D对象跟踪的丰富视觉表示。通过物理世界本身是静态的事实，我们为这种追求而动机，多视图对应标签相对便宜，可以在静态场景中收集，例如，通过三角剖分。我们建议在任意场景（静态或动态）中利用\ textit {static points}的多视图数据，以学习一个神经3D映射模块，该模块可在跨时间产生相应的功能。 Neural 3D映射器将RGB-D数据作为输入消耗，并产生3D Voxel Grid作为输出。我们使用对比度损失来训练Voxel特征，以在各个角度进行相应，并且在时间之间的相应性会自动出现。在测试时间，给定带有近似相机姿势的RGB-D视频，并给定一个要跟踪的对象的3D框，我们通过生成每个时间段的地图并在每个地图中找到对象的特征来跟踪目标对象。与代表2D或2.5D的视频流的模型相反，我们的模型的3D场景表示与投影伪像脱离，在摄像机运动下是稳定的，并且对部分闭塞是可靠的。我们在挑战模拟和真实数据中测试了所提出的体系结构，并表明我们的无监督3D对象跟踪器优于事先无监督的2D和2.5D跟踪器，并接近监督跟踪器的准确性。这项工作表明，3D对象跟踪器可以通过静态数据上的多视媒体自我观察而无需跟踪标签而出现。

We hypothesize that an agent that can look around in static scenes can learn rich visual representations applicable to 3D object tracking in complex dynamic scenes. We are motivated in this pursuit by the fact that the physical world itself is mostly static, and multiview correspondence labels are relatively cheap to collect in static scenes, e.g., by triangulation. We propose to leverage multiview data of \textit{static points} in arbitrary scenes (static or dynamic), to learn a neural 3D mapping module which produces features that are correspondable across time. The neural 3D mapper consumes RGB-D data as input, and produces a 3D voxel grid of deep features as output. We train the voxel features to be correspondable across viewpoints, using a contrastive loss, and correspondability across time emerges automatically. At test time, given an RGB-D video with approximate camera poses, and given the 3D box of an object to track, we track the target object by generating a map of each timestep and locating the object's features within each map. In contrast to models that represent video streams in 2D or 2.5D, our model's 3D scene representation is disentangled from projection artifacts, is stable under camera motion, and is robust to partial occlusions. We test the proposed architectures in challenging simulated and real data, and show that our unsupervised 3D object trackers outperform prior unsupervised 2D and 2.5D trackers, and approach the accuracy of supervised trackers. This work demonstrates that 3D object trackers can emerge without tracking labels, through multiview self-supervision on static data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题