图像到limar自制蒸馏用于自动驾驶数据

论文标题

图像到limar自制蒸馏用于自动驾驶数据

Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data

论文作者

Sautier, Corentin, Puy, Gilles, Gidaris, Spyros, Boulch, Alexandre, Bursuc, Andrei, Marlet, Renaud

论文摘要

在稀疏的LIDAR点云中进行分割或检测物体是自动驾驶中的两个重要任务，以使车辆在其3D环境中安全起作用。 3D语义分割或对象检测中的最佳性能方法取决于大量注释数据。然而，为这些任务注释3D LiDAR数据是乏味和昂贵的。在这种情况下，我们为3D感知模型提出了一种自主驾驶模型的自我监管的预训练方法。具体而言，我们利用自主驾驶设置中同步和校准图像和激光雷达传感器的可用性，以将自我监管的预训练的图像表示形式提炼成3D模型。因此，我们的方法不需要任何点云或图像注释。我们方法的关键成分是使用超级像素，用于在视觉上相似的区域中用于池3D点特征和2D像素功能。然后，我们训练一个3D网络，将这些合并点功能与相应的汇总图像像素功能匹配的自我监督任务。超级像素获得的对比区域的优势是：（1）将视觉相干区域的像素和点组合在一起，导致更有意义的对比任务，该任务产生的特征非常适合3D语义细分和3D对象检测；（2）在对比度损失中，所有不同区域的重量都具有相同的重量，而不管这些区域中采样的3D点数量如何；（3）由于不同传感器之间的遮挡而导致的点和像素的匹配不正确，它可以减轻产生的噪声。自主驾驶数据集的广泛实验证明了我们的图像到驱动器蒸馏策略产生3D表示的能力，这些表示在语义细分和对象检测任务上都很好地传输了。

Segmenting or detecting objects in sparse Lidar point clouds are two important tasks in autonomous driving to allow a vehicle to act safely in its 3D environment. The best performing methods in 3D semantic segmentation or object detection rely on a large amount of annotated data. Yet annotating 3D Lidar data for these tasks is tedious and costly. In this context, we propose a self-supervised pre-training method for 3D perception models that is tailored to autonomous driving data. Specifically, we leverage the availability of synchronized and calibrated image and Lidar sensors in autonomous driving setups for distilling self-supervised pre-trained image representations into 3D models. Hence, our method does not require any point cloud nor image annotations. The key ingredient of our method is the use of superpixels which are used to pool 3D point features and 2D pixel features in visually similar regions. We then train a 3D network on the self-supervised task of matching these pooled point features with the corresponding pooled image pixel features. The advantages of contrasting regions obtained by superpixels are that: (1) grouping together pixels and points of visually coherent regions leads to a more meaningful contrastive task that produces features well adapted to 3D semantic segmentation and 3D object detection; (2) all the different regions have the same weight in the contrastive loss regardless of the number of 3D points sampled in these regions; (3) it mitigates the noise produced by incorrect matching of points and pixels due to occlusions between the different sensors. Extensive experiments on autonomous driving datasets demonstrate the ability of our image-to-Lidar distillation strategy to produce 3D representations that transfer well on semantic segmentation and object detection tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题