论文标题
双侧微小的变压器,用于弹性有效的视觉问题回答
Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
论文作者
论文摘要
变压器体系结构的最新进展[1]为视觉问题答案(VQA)带来了显着的改进。然而,基于变压器的VQA模型通常是深度且宽阔的,以确保良好的性能,因此它们只能在功能强大的GPU服务器上运行,并且不能在容量限制的平台(例如手机)上运行。因此,希望学习一个弹性VQA模型,该模型支持运行时自适应修剪以满足不同平台的效率约束。为此,我们介绍了双侧微小的变压器(BST),该框架可以无缝集成到基于任意变压器的VQA模型中,以一次训练一个单个模型并获得不同宽度和深度的各种纤细的子模型。为了验证该方法的有效性和通用性,我们将提出的BST框架与三种典型的基于变压器的VQA方法(即McAn [2],Uniter [3]和Clip-Vil [4]集成在一起,并在两个常用的基准标准数据集中进行大量实验。特别是,一个纤细的MCAN-BST子模型在VQA-V2上具有可比的精度,而模型尺寸小0.38倍,并且比参考MCAN模型少0.27倍。最小的MCAN-BST子模型在推断过程中只有900万参数和0.16G拖鞋,因此可以将其部署在延迟少于60 ms的移动设备上。
Recent advances in Transformer architectures [1] have brought remarkable improvements to visual question answering (VQA). Nevertheless, Transformer-based VQA models are usually deep and wide to guarantee good performance, so they can only run on powerful GPU servers and cannot run on capacity-restricted platforms such as mobile phones. Therefore, it is desirable to learn an elastic VQA model that supports adaptive pruning at runtime to meet the efficiency constraints of different platforms. To this end, we present the bilaterally slimmable Transformer (BST), a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. To verify the effectiveness and generality of this method, we integrate the proposed BST framework with three typical Transformer-based VQA approaches, namely MCAN [2], UNITER [3], and CLIP-ViL [4], and conduct extensive experiments on two commonly-used benchmark datasets. In particular, one slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2, while being 0.38x smaller in model size and having 0.27x fewer FLOPs than the reference MCAN model. The smallest MCAN-BST submodel only has 9M parameters and 0.16G FLOPs during inference, making it possible to deploy it on a mobile device with less than 60 ms latency.