论文标题
Time3D:自动驾驶的端到端关节单眼3D对象检测和跟踪
Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving
论文作者
论文摘要
虽然可以单独利用单眼3D对象检测和2D多对象跟踪,可以直接应用于逐帧时尚的序列图像,但独立的跟踪器切断了从3D检测器到跟踪的不确定性传输,而无法将误差差异传递回3D检测器。在这项工作中,我们建议以端到端的方式从单眼视频中共同培训3D检测和3D跟踪。关键组件是一个新型的时空信息流模块,该模块汇总了几何和外观特征,以预测当前和过去帧中所有对象的强大相似性得分。具体而言,我们利用了变压器的注意机制,其中自我发明在特定框架中汇总了空间信息,并且交叉发音利用了序列框架时间域中所有对象的关系和亲密关系。然后监督亲和力以估计轨迹并指导相应3D对象之间的信息流。此外,我们提出了一个时间 - 明确涉及3D目标运动建模到学习中的一致性丢失,从而使3D轨迹在世界坐标系统中平稳。 Time3D在Nuscenes 3D跟踪基准上取得了21.4 \%AMOTA,13.6 \%AMOTP,超过所有已发表的竞争对手,并以38 fps运行,而Time3d则达到39.2 \%地图,39.4 \%\%\%\%\%\%\%nds在3D检测基准上。
While separately leveraging monocular 3D object detection and 2D multi-object tracking can be straightforwardly applied to sequence images in a frame-by-frame fashion, stand-alone tracker cuts off the transmission of the uncertainty from the 3D detector to tracking while cannot pass tracking error differentials back to the 3D detector. In this work, we propose jointly training 3D detection and 3D tracking from only monocular videos in an end-to-end manner. The key component is a novel spatial-temporal information flow module that aggregates geometric and appearance features to predict robust similarity scores across all objects in current and past frames. Specifically, we leverage the attention mechanism of the transformer, in which self-attention aggregates the spatial information in a specific frame, and cross-attention exploits relation and affinities of all objects in the temporal domain of sequence frames. The affinities are then supervised to estimate the trajectory and guide the flow of information between corresponding 3D objects. In addition, we propose a temporal -consistency loss that explicitly involves 3D target motion modeling into the learning, making the 3D trajectory smooth in the world coordinate system. Time3D achieves 21.4\% AMOTA, 13.6\% AMOTP on the nuScenes 3D tracking benchmark, surpassing all published competitors, and running at 38 FPS, while Time3D achieves 31.2\% mAP, 39.4\% NDS on the nuScenes 3D detection benchmark.