论文标题
视觉问题回答的压缩和掩盖视觉语言预先训练的模型
Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
论文作者
论文摘要
尽管视觉语言预训练模型(VLP)在常规VQA任务上表现出色,但它们仍然遭受两个问题的困扰:首先,VLP倾向于依靠数据集中的语言偏见,并且未能推广到分布(OOD)数据。其次,它们在记忆足迹和计算方面效率低下。尽管在这两个问题中都取得了有希望的进展,但大多数现有作品都独立解决它们。为了促进VLP在VQA任务中的应用,必须共同研究VLP压缩和OOD鲁棒性,但是尚未探讨这一问题。本文通过搜索稀疏且强大的子网搜索来调查VLP是否可以同时压缩和cl依。为此,我们系统地研究了训练和压缩管道的设计,以搜索子网,以及将稀疏性分配到不同模态的模块上。我们的实验涉及3种VLP,2种压缩方法,4种训练方法,2个数据集以及一系列的稀疏度和随机种子。我们的结果表明,确实存在稀疏且坚固的子网,它们与demias的完整VLP竞争,并且在OOD数据集VQA-CP V2和VQA-VS上显然超过了较少参数的DOMIAS SOTA。这些代码可以在https://github.com/phoebussi/compress-robust-vqa上找到。
Despite the excellent performance of vision-language pre-trained models (VLPs) on conventional VQA task, they still suffer from two problems: First, VLPs tend to rely on language biases in datasets and fail to generalize to out-of-distribution (OOD) data. Second, they are inefficient in terms of memory footprint and computation. Although promising progress has been made in both problems, most existing works tackle them independently. To facilitate the application of VLP to VQA tasks, it is imperative to jointly study VLP compression and OOD robustness, which, however, has not yet been explored. This paper investigates whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. To this end, we systematically study the design of a training and compression pipeline to search the subnetworks, as well as the assignment of sparsity to different modality-specific modules. Our experiments involve 3 VLPs, 2 compression methods, 4 training methods, 2 datasets and a range of sparsity levels and random seeds. Our results show that there indeed exist sparse and robust subnetworks, which are competitive with the debiased full VLP and clearly outperform the debiasing SoTAs with fewer parameters on OOD datasets VQA-CP v2 and VQA-VS. The codes can be found at https://github.com/PhoebusSi/Compress-Robust-VQA.