论文标题
通过视觉知识蒸馏启用剪辑上的多模式生成
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation
论文作者
论文摘要
具有大量图像文本对数据的双流体系结构(例如剪辑)的最近大规模视觉预训练(VLP)显示了其在各种多模式比对任务上的优越性。尽管它成功了,但由于文本编码器较弱,最终的模型无法执行多模式生成任务。为了解决这个问题,我们建议通过视觉知识蒸馏(VLKD)使用文本预训练的语言模型(PLM)来增强双流VLP模型,从而使多模式生成具有能力。与从头开始的预训练相比,VLKD的数据和计算有效。实验结果表明,所得模型在多模式生成任务上具有强大的零拍摄性能,例如开放式视觉问题回答和图像字幕。例如,它在VQAV2数据集上实现了44.5%的零摄影精度,以$ 7 \ times $少$ $较少的参数超过了先前的最新零摄像模型。此外,PLM的原始文本语言理解和生成能力是在VLKD之后保持的,VLKD使我们的模型具有多模式和单峰任务的模型。
The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (e.g., CLIP) with a tremendous amount of image-text pair data, has shown its superiority on various multimodal alignment tasks. Despite its success, the resulting models are not capable of multimodal generative tasks due to the weak text encoder. To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation. VLKD is pretty data- and computation-efficient compared to the pre-training from scratch. Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. For example, it achieves 44.5% zero-shot accuracy on the VQAv2 dataset, surpassing the previous state-of-the-art zero-shot model with $7\times$ fewer parameters. Furthermore, the original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.