用于掩盖图像建模的绿色分层视觉变压器

论文标题

用于掩盖图像建模的绿色分层视觉变压器

Green Hierarchical Vision Transformer for Masked Image Modeling

论文作者

Huang, Lang, You, Shan, Zheng, Mingkai, Wang, Fei, Qian, Chen, Yamasaki, Toshihiko

论文摘要

我们提出了一种具有分层视觉变压器（VIT）的掩盖图像建模（MIM）的有效方法，从而允许层次VIT丢弃掩盖的贴剂，并且仅在可见的贴片上进行操作。我们的方法包括三个关键设计。首先，为了窗口的关注，我们建议按照分界线和纠纷策略进行群体窗户注意力方案。为了减轻自我发挥的二次复杂性W.R.T.斑块的数量，组的注意力鼓励一个统一的分区，即可以将每个任意大小的本地窗口内的可见斑块分组为相等的大小，然后在每个组中进行掩盖的自我注意力。其次，我们通过动态编程算法进一步改善了分组策略，以最大程度地降低分组贴片上注意力的总体计算成本。第三，至于卷积层，我们将它们转换为稀疏的卷积，该卷积与稀疏数据无缝工作，即MIM中的可见贴片。结果，MIM现在可以以绿色和有效的方式在大多数（如果不是全部）（如果不是全部）上进行层次结构。例如，我们可以训练分层VIT，例如Swin Transformer和Twins Transformer，约2.7 $ \ times $ $ $ \ times $，并将GPU的内存使用量减少70％，同时在Imagenet分类上仍然享有竞争性能，并且在下游可可对象检测基准方面的优势。代码和预训练的模型已在https://github.com/layneh/greenmim上公开提供。

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution that works seamlessly with the sparse data, i.e., the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e.g., Swin Transformer and Twins Transformer, about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题