论文标题
学会合并视觉变压器中的令牌
Learning to Merge Tokens in Vision Transformers
论文作者
论文摘要
变压器被广泛应用于解决自然语言理解和计算机视觉任务。在扩大这些体系结构会导致性能的提高时,通常以更高的计算成本为代价。为了使大型模型在实际系统中保持实用性,需要减少其计算开销。在这项工作中,我们介绍了PatchMerger,这是一个简单的模块,可减少网络必须通过在两个连续的中间层之间合并来处理的补丁或令牌。我们表明,PatchMerger在微调后匹配上游和下游的原始性能,在各种模型尺寸上实现了显着的加速。
Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In order for large-scale models to remain practical in real-world systems, there is a need for reducing their computational overhead. In this work, we present the PatchMerger, a simple module that reduces the number of patches or tokens the network has to process by merging them between two consecutive intermediate layers. We show that the PatchMerger achieves a significant speedup across various model sizes while matching the original performance both upstream and downstream after fine-tuning.