论文标题
HRFUSER:2D对象检测的多分辨率传感器融合体系结构
HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection
论文作者
论文摘要
除标准摄像机外,自动驾驶汽车通常还包括多个其他传感器,例如激光镜和雷达,这些传感器有助于获取更丰富的信息以感知驾驶场景的内容。尽管最近的几项工作着重于通过使用特定于检查设置的架构组件来融合某些传感器,例如带有雷达或雷达的相机,但文献中缺少了通用和模块化传感器融合体系结构。在这项工作中,我们提出了HRFUSER,这是用于多模式2D对象检测的模块化体系结构。它以多分辨率的方式融合了多个传感器,并将其缩放到任意数量的输入方式。 HRFUSER的设计基于最新的高分辨率网络,用于仅图像密集的预测,并结合了一种新型的多窗口跨注意区块,作为在多种分辨率下进行多种模态融合的手段。我们通过对Nuscenes和不利条件的密集数据集进行了广泛的实验,我们的模型有效地利用了其他模式的互补特征,从相机的性能上大大改善了仅在摄像机的性能方面,并且始终如一地超过了对2D对象检测Metrics评估的最先进的3D和2D融合方法。源代码可公开可用。
Besides standard cameras, autonomous vehicles typically include multiple additional sensors, such as lidars and radars, which help acquire richer information for perceiving the content of the driving scene. While several recent works focus on fusing certain pairs of sensors - such as camera with lidar or radar - by using architectural components specific to the examined setting, a generic and modular sensor fusion architecture is missing from the literature. In this work, we propose HRFuser, a modular architecture for multi-modal 2D object detection. It fuses multiple sensors in a multi-resolution fashion and scales to an arbitrary number of input modalities. The design of HRFuser is based on state-of-the-art high-resolution networks for image-only dense prediction and incorporates a novel multi-window cross-attention block as the means to perform fusion of multiple modalities at multiple resolutions. We demonstrate via extensive experiments on nuScenes and the adverse conditions DENSE datasets that our model effectively leverages complementary features from additional modalities, substantially improving upon camera-only performance and consistently outperforming state-of-the-art 3D and 2D fusion methods evaluated on 2D object detection metrics. The source code is publicly available.