HIVIT：分层视觉变压器符合蒙版的图像建模

论文标题

HIVIT：分层视觉变压器符合蒙版的图像建模

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

论文作者

Zhang, Xiaosong, Tian, Yunjie, Huang, Wei, Ye, Qixiang, Dai, Qi, Xie, Lingxi, Tian, Qi

论文摘要

最近，蒙面的图像建模（MIM）提供了一种自我监督的视觉变压器预训练的新方法。有效实施的一个关键思想是在整个目标网络（编码器）中丢弃掩盖的图像贴片（或令牌），这要求编码器是纯正视觉变压器（例如VIT），尽管层次视觉变压器（例如，Swin变压器）具有更好的属性，可以在形式的视觉输入中具有更好的属性。在本文中，我们提供了一种新设计的层次视觉变压器，名为Hivit（hivit的缩写），该设计在MIM中既具有高效率又具有良好的性能。关键是要删除不必要的“本地单位间操作”，从而得出结构上简单的层次视觉变压器，其中可以像纯视觉变压器一样序列化掩模。为此，我们从Swin Transformer开始，（i）将屏蔽单元的大小设置为Swin Transformer主阶段的令牌大小，（ii）在主阶段之前关闭单个单位间自我攻击，（iii）消除主阶段后的所有操作。实证研究表明，在完全监督，自学和转移学习方面，HIVIT的优势表现。特别是，在ImagEnet-1k上运行MAE时，Hivit-B报告了VIT-B的准确度 +0.6％，而SWIN-B比SWIN-B的速度为1.9 $ \ times $，并且性能增益概括为下游检测和细分任务。代码将公开可用。

Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$\times$ speed-up over Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题