剪辑-TD：夹子式蒸馏视觉任务

论文标题

剪辑-TD：夹子式蒸馏视觉任务

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

论文作者

Wang, Zhecan, Codella, Noel, Chen, Yen-Chun, Zhou, Luowei, Yang, Jianwei, Dai, Xiyang, Xiao, Bin, You, Haoxuan, Chang, Shih-Fu, Yuan, Lu

论文摘要

对比性语言图像预处理（剪辑）将视觉和语言方式连接到统一的嵌入空间，从而产生了视觉语言（VL）任务的巨大潜力。尽管早期的同时工作已经开始研究这一潜力，但重要的问题仍然存在：1）剪辑对未研究的VL任务有什么好处？ 2）剪辑在低射门或域偏移方案中是否提供好处？ 3）剪辑可以改善现有方法而不会影响推理或预处理复杂性吗？在这项工作中，我们试图通过两个主要贡献来回答这些问题。首先，我们介绍了一个评估协议，其中包括视觉常识性推理（VCR），视觉构成（SNLI-VE）和视觉问题答案（VQA），这些数据可用性约束和域移动条件。其次，我们提出了一种名为剪辑靶向蒸馏（夹-TD）的方法，以使用将动态加权的目标应用于每个实例适应性选择的令牌。实验表明，我们提出的夹TD导致VCR的低射击（高达51.9％）和域移位（高达71.3％）的范围（最高71.3％），同时在标准的完全监督条件下（高达2％）（高达2％）的性能在VCR上与其他单个模型相比，在标准的完全监督条件（高达2％）中提高了效果。在SNLI-VE上，夹-TD在低射击条件下（高达6.6％）以及完全监督（最高3％）可产生显着增长。在VQA上，Clip-TD可改善低射击（最高9％），并在全面监督的情况下（高达1.3％）。最后，剪辑-TD的表现优于同时使用夹子进行填充的工程，以及基线天真的蒸馏方法。代码将提供。

Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this potential on a subset of tasks, important questions remain: 1) What is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches without impacting inference or pretraining complexity? In this work, we seek to answer these questions through two key contributions. First, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data availability constraints and conditions of domain shift. Second, we propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures using a dynamically weighted objective applied to adaptively selected tokens per instance. Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to 71.3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only. On SNLI-VE, CLIP-TD produces significant gains in low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works utilizing CLIP for finetuning, as well as baseline naive distillation approaches. Code will be made available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题