单模型：单眼3D对象检测的深度引导变压器

论文标题

单模型：单眼3D对象检测的深度引导变压器

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

论文作者

Zhang, Renrui, Qiu, Han, Wang, Tai, Guo, Ziyu, Tang, Yiwen, Xu, Xuanzhuo, Cui, Ziteng, Qiao, Yu, Gao, Peng, Li, Hongsheng

论文摘要

长期以来，单眼3D对象检测一直是自主驾驶中的一项挑战。大多数现有方法遵循常规2D检测器首先定位对象中心，然后通过相邻功能预测3D属性。但是，只有使用本地视觉特征不足以理解场景级别的3D空间结构，而忽略了远程对象间深度关系。在本文中，我们介绍了第一个使用深度引导变压器（名为Monototer）的单眼检测的DETR框架。我们将香草变压器修改为深度了解，并通过上下文深度提示指导整个检测过程。具体而言，同时介绍了捕获对象外观的视觉编码器，我们介绍以预测前景深度图，并专门研究深度编码器来提取非本地深度嵌入。然后，我们将3D对象候选物作为可学习的查询，并提出一个深度引导的解码器来进行对象场景深度相互作用。这样，每个对象查询从图像上的深度引导区域自适应地估算其3D属性，并且不再限制在本地视觉特征上。在具有单眼图像作为输入的KITTI基准测试中，单模型可实现最先进的性能，并且不需要额外的密度深度注释。此外，我们的深度引导模块也可以是插件，以增强Nuscenes数据集上的多视图3D对象检测器，以证明我们的卓越概括能力。代码可在https://github.com/zrrskywalker/monodetr上找到。

Monocular 3D object detection has long been a challenging task in autonomous driving. Most existing methods follow conventional 2D detectors to first localize object centers, and then predict 3D attributes by neighboring features. However, only using local visual features is insufficient to understand the scene-level 3D spatial structures and ignores the long-range inter-object depth relations. In this paper, we introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR. We modify the vanilla transformer to be depth-aware and guide the whole detection process by contextual depth cues. Specifically, concurrent to the visual encoder that captures object appearances, we introduce to predict a foreground depth map, and specialize a depth encoder to extract non-local depth embeddings. Then, we formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions. In this way, each object query estimates its 3D attributes adaptively from the depth-guided regions on the image and is no longer constrained to local visual features. On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations. Besides, our depth-guided modules can also be plug-and-play to enhance multi-view 3D object detectors on nuScenes dataset, demonstrating our superior generalization capacity. Code is available at https://github.com/ZrrSkywalker/MonoDETR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题