Mixformer：端到端跟踪，迭代混合注意力

论文标题

Mixformer：端到端跟踪，迭代混合注意力

MixFormer: End-to-End Tracking with Iterative Mixed Attention

论文作者

Cui, Yutao, Jiang, Cheng, Wang, Limin, Wu, Gangshan

论文摘要

跟踪通常使用特征提取，目标信息集成和边界框估计的多阶段管道。为了简化该管道并统一特征提取和目标信息集成的过程，我们提出了一个紧凑的跟踪框架，称为Mixformer，该框架是在变压器上构建的。我们的核心设计是利用注意操作的灵活性，并提出一个混合注意模块（MAM），以同时提取特征和目标信息集成。这种同步建模方案允许提取目标特定的隔离特征，并在目标和搜索区域之间进行广泛的通信。基于MAM，我们仅通过堆叠带有渐进贴片嵌入的多个MAM并将本地化头放在顶部来构建混合形式跟踪框架。此外，为了在线跟踪过程中处理多个目标模板，我们设计了MAM中的不对称注意力方案，以降低计算成本，并提出一个有效的得分预测模块以选择高质量的模板。我们的混合配方器在五个跟踪基准上设定了新的最先进的性能，包括Lasot，Trackingnet，vot2020，GoT-10k和UAV123。特别是，我们的Mixformer-L在Lasot上的NP得分为79.9％，TrackingNet的88.9％，在Dot2020上的NP分数为0.555。我们还进行了深入的消融研究，以证明同时特征提取和信息整合的有效性。代码和训练有素的模型可在https://github.com/mcg-nju/mixformer上公开获取。

Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题