TRANSVPR：具有多层次注意聚合的基于变压器的位置识别

论文标题

TRANSVPR：具有多层次注意聚合的基于变压器的位置识别

TransVPR: Transformer-based place recognition with multi-level attention aggregation

论文作者

Wang, Ruotong, Shen, Yanqing, Zuo, Weiliang, Zhou, Sanping, Zheng, Nanning

论文摘要

对于自动驾驶导航和移动机器人本地化等应用程序而言，Visual Place识别是一项具有挑战性的任务。分散在复杂场景中呈现的元素通常会导致视觉场所感知的偏差。为了解决这个问题，将仅与任务相关区域的信息集成到图像表示中至关重要。在本文中，我们基于视觉变压器介绍了一种新型的整体识别模型TransVPR。它受益于自然可以汇总与任务相关的功能的自我发挥操作的理想特性。从多个级别的变压器的注意力集中在感兴趣的不同区域上，进一步结合在一起，以产生全局图像表示。此外，由融合注意力掩码过滤的变压器层的输出令牌被认为是钥匙点描述符，这些描述符用于执行空间匹配，以重新对由全局图像特征检索到的候选者。整个模型允许通过单个目标和图像级监督进行端到端培训。 TRANSVPR在几个现实世界的基准测试中实现最新的性能，同时保持较低的计算时间和存储要求。

Visual place recognition is a challenging task for applications such as autonomous driving navigation and mobile robot localization. Distracting elements presenting in complex scenes often lead to deviations in the perception of visual place. To address this problem, it is crucial to integrate information from only task-relevant regions into image representations. In this paper, we introduce a novel holistic place recognition model, TransVPR, based on vision Transformers. It benefits from the desirable property of the self-attention operation in Transformers which can naturally aggregate task-relevant features. Attentions from multiple levels of the Transformer, which focus on different regions of interest, are further combined to generate a global image representation. In addition, the output tokens from Transformer layers filtered by the fused attention mask are considered as key-patch descriptors, which are used to perform spatial matching to re-rank the candidates retrieved by the global image features. The whole model allows end-to-end training with a single objective and image-level supervision. TransVPR achieves state-of-the-art performance on several real-world benchmarks while maintaining low computational time and storage requirements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题