X-VIT：没有软性的高性能线性视觉变压器

论文标题

X-VIT：没有软性的高性能线性视觉变压器

X-ViT: High Performance Linear Vision Transformer without Softmax

论文作者

Song, Jeonggeun, Lee, Heung-Chang

论文摘要

视觉变压器已成为计算机视觉任务最重要的模型之一。尽管他们表现优于先前的工作，但他们需要大量的计算资源，该规模与代币数量（$ n $）的数量是二次。这是传统自我注意力（SA）算法的主要缺点。在这里，我们提出了具有线性复杂性的新型SA机制的X-Vit，VIT。这项工作的主要方法是消除原始SA的非线性。我们将SA机理的矩阵乘法分解，而无需复杂的线性近似。通过仅修改原始SA中的几行代码，在大多数容量制度上，建议的模型在图像分类和密集的预测任务上优于大多数基于变压器的模型。

Vision transformers have become one of the most important models for computer vision tasks. Although they outperform prior works, they require heavy computational resources on a scale that is quadratic to the number of tokens, $N$. This is a major drawback of the traditional self-attention (SA) algorithm. Here, we propose the X-ViT, ViT with a novel SA mechanism that has linear complexity. The main approach of this work is to eliminate nonlinearity from the original SA. We factorize the matrix multiplication of the SA mechanism without complicated linear approximation. By modifying only a few lines of code from the original SA, the proposed models outperform most transformer-based models on image classification and dense prediction tasks on most capacity regimes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题