基于注意的深度蒸馏，具有3D意见的位置编码，用于单眼3D对象检测

论文标题

基于注意的深度蒸馏，具有3D意见的位置编码，用于单眼3D对象检测

Attention-Based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection

论文作者

Wu, Zizhang, Wu, Yunzhe, Pu, Jian, Li, Xianzhi, Wang, Xiaoquan

论文摘要

单程3D对象检测是一项低成本但具有挑战性的任务，因为它需要仅从单个图像输入中生成准确的3D定位。最新开发的深度辅助方法通过使用明确的深度图作为中间特征显示出令人鼓舞的结果，该特征由单眼深度估计网络预先计算，或者通过3D对象检测进行了共同评估。但是，估计深度先验的不可避免的错误可能导致语义信息不一致和3D定位，从而导致功能涂抹和次优预测。为了减轻此问题，我们建议使用3D感知位置编码的基于注意力的深度知识蒸馏框架。与以前的知识蒸馏框架采用立体声或激光雷达教师的老师不同，我们以与学生相同的架构建立了我们的老师，但具有与输入相同的地面深度。归功于我们的教师设计，我们的框架是无缝的，无域差距的，易于实现的，并且与对象的地面真相深度兼容。具体而言，我们利用中间特征和响应来进行知识蒸馏。考虑到远程3D依赖性，我们提出\ emph {3d-ware自我注意力}和\ emph {target-aware sance cross-tearmention}用于学生适应的模块。进行了广泛的实验，以验证我们框架对具有挑战性的Kitti 3D对象检测基准的有效性。我们在三个代表性的单眼探测器上实施框架，并实现最先进的性能，而没有相对于基线模型的其他推理计算成本。我们的代码可在https://github.com/rockywind/add上找到。

Monocular 3D object detection is a low-cost but challenging task, as it requires generating accurate 3D localization solely from a single image input. Recent developed depth-assisted methods show promising results by using explicit depth maps as intermediate features, which are either precomputed by monocular depth estimation networks or jointly evaluated with 3D object detection. However, inevitable errors from estimated depth priors may lead to misaligned semantic information and 3D localization, hence resulting in feature smearing and suboptimal predictions. To mitigate this issue, we propose ADD, an Attention-based Depth knowledge Distillation framework with 3D-aware positional encoding. Unlike previous knowledge distillation frameworks that adopt stereo- or LiDAR-based teachers, we build up our teacher with identical architecture as the student but with extra ground-truth depth as input. Credit to our teacher design, our framework is seamless, domain-gap free, easily implementable, and is compatible with object-wise ground-truth depth. Specifically, we leverage intermediate features and responses for knowledge distillation. Considering long-range 3D dependencies, we propose \emph{3D-aware self-attention} and \emph{target-aware cross-attention} modules for student adaptation. Extensive experiments are performed to verify the effectiveness of our framework on the challenging KITTI 3D object detection benchmark. We implement our framework on three representative monocular detectors, and we achieve state-of-the-art performance with no additional inference computational cost relative to baseline models. Our code is available at https://github.com/rockywind/ADD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题