论文标题

架构 - 敏捷的蒙版图像建模 - 从VIT返回到CNN

Architecture-Agnostic Masked Image Modeling -- From ViT back to CNN

论文作者

Li, Siyuan, Wu, Di, Wu, Fang, Zang, Zelin, Li, Stan. Z.

论文摘要

蒙面图像建模是一种新兴的自我监督的预训练方法,在通过视觉变压器的许多下游视觉任务中显示出令人印象深刻的成功。它的基本想法很简单:输入映像的一部分被掩盖,然后通过文本任务重建。但是,MIM背后的工作原理没有得到很好的解释,先前的研究坚持认为MIM主要适用于变压器家族,但与CNN不兼容。在这项工作中,我们观察到MIM基本上教导该模型学习贴片之间更好的中阶相互作用,以进行更广泛的特征提取。然后,我们提出了一个体系结构掩盖的图像建模框架($^2 $ mim),该框架与变形金刚和CNN都以统一的方式兼容。对流行基准测试的广泛实验表明,$^2 $ mim可以在没有明确设计的情况下学习更好的表示形式,并赋予骨干模型具有更强的能力转移到各种下游任务。

Masked image modeling, an emerging self-supervised pre-training method, has shown impressive success across numerous downstream vision tasks with Vision transformers. Its underlying idea is simple: a portion of the input image is masked out and then reconstructed via a pre-text task. However, the working principle behind MIM is not well explained, and previous studies insist that MIM primarily works for the Transformer family but is incompatible with CNNs. In this work, we observe that MIM essentially teaches the model to learn better middle-order interactions among patches for more generalized feature extraction. We then propose an Architecture-Agnostic Masked Image Modeling framework (A$^2$MIM), which is compatible with both Transformers and CNNs in a unified way. Extensive experiments on popular benchmarks show that A$^2$MIM learns better representations without explicit design and endows the backbone model with the stronger capability to transfer to various downstream tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源