论文标题
Fe-Fusion-VPR:基于注意的多尺度网络体系结构,用于通过融合框架和事件来识别视觉位置识别
FE-Fusion-VPR: Attention-based Multi-Scale Network Architecture for Visual Place Recognition by Fusing Frames and Events
论文作者
论文摘要
通常使用标准摄像头的传统视觉位置识别(VPR)由于眩光或高速运动而容易失败。相比之下,事件摄像机具有低潜伏期,高时间分辨率和高动态范围的优点,可以处理上述问题。然而,事件摄像机容易在弱纹理或一动不动的场景中失败,而在这种情况下,标准摄像头仍然可以提供外观信息。因此,利用标准摄像机和事件摄像机的互补性可以有效地提高VPR算法的性能。在本文中,我们提出了Fe-Fusion-VPR,这是一种基于注意力的多尺度网络架构,用于VPR,通过融合框架和事件。首先,将强度框架和事件体积馈入浅融合的两流特征提取网络中。接下来,通过多尺度融合网络获得了三尺度的特征,并使用VLAD层汇总为三个子描述符。最后,通过描述符重新加权网络学习每个子描述符的权重以获取最终的精制描述符。实验结果表明,在Brisbane-Event-VPR和DDD20数据集上,我们的Fe-Fusion-VPR的回忆@1比Event-vpr和Ensemble-EventVPR高29.26%,比Ensemble-EventVPR高33.59%,并且比Multires-Netvlad和NetVlad高7.00%,高14.15%。据我们所知,这是第一个端到端网络,它超出了现有的基于事件和基于框架的SOTA方法,可以直接为VPR融合框架和事件。
Traditional visual place recognition (VPR), usually using standard cameras, is easy to fail due to glare or high-speed motion. By contrast, event cameras have the advantages of low latency, high temporal resolution, and high dynamic range, which can deal with the above issues. Nevertheless, event cameras are prone to failure in weakly textured or motionless scenes, while standard cameras can still provide appearance information in this case. Thus, exploiting the complementarity of standard cameras and event cameras can effectively improve the performance of VPR algorithms. In the paper, we propose FE-Fusion-VPR, an attention-based multi-scale network architecture for VPR by fusing frames and events. First, the intensity frame and event volume are fed into the two-stream feature extraction network for shallow feature fusion. Next, the three-scale features are obtained through the multi-scale fusion network and aggregated into three sub-descriptors using the VLAD layer. Finally, the weight of each sub-descriptor is learned through the descriptor re-weighting network to obtain the final refined descriptor. Experimental results show that on the Brisbane-Event-VPR and DDD20 datasets, the Recall@1 of our FE-Fusion-VPR is 29.26% and 33.59% higher than Event-VPR and Ensemble-EventVPR, and is 7.00% and 14.15% higher than MultiRes-NetVLAD and NetVLAD. To our knowledge, this is the first end-to-end network that goes beyond the existing event-based and frame-based SOTA methods to fuse frame and events directly for VPR.