ICAR：视觉识别的桥接图像分类和图像文本对齐

论文标题

ICAR：视觉识别的桥接图像分类和图像文本对齐

iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition

论文作者

Wei, Yixuan, Cao, Yue, Zhang, Zheng, Yao, Zhuliang, Xie, Zhenda, Hu, Han, Guo, Baining

论文摘要

图像分类按预定义的类别进行了对图像进行分类，在过去的十年中一直是视觉表示学习的主要方法。然而，通过图像文本对齐的视觉学习已经出现，以显示出令人鼓舞的性能，尤其是对于零拍的识别。我们认为，这两个学习任务是互补的，并建议将它们结合起来，以更好地视觉学习。我们提出了一种深层融合方法，具有三种改编，可以有效地桥接两个学习任务，而不是通过幼稚的多任务学习浅融合。首先，我们修改了图像分类（线性分类器）的先前常见实践，并具有余弦分类器，该分类器显示出可比的性能。其次，我们将图像分类问题从学习参数类别分类器权重转换为学习文本编码器作为元网络以生成类别分类器权重。学到的文本编码器在图像分类和图像文本对齐之间共享。第三，我们用描述丰富了每个类名称，以避免类之间的混淆，并使分类方法更接近图像文本对齐。我们证明，这种深层融合方法在各种视觉识别任务和设置上的表现要比单独的学习或浅融合方法，从零拍/少数图像分类（例如Kornblith 12-Dataset基准）到下游行动识别任务，语义识别，语义段，对象检测和对象进行微调和开放式设置。该代码将在https://github.com/weiyx16/icar上找到。

Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade. Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition. We believe that these two learning tasks are complementary, and suggest combining them for better visual learning. We propose a deep fusion method with three adaptations that effectively bridge two learning tasks, rather than shallow fusion through naive multi-task learning. First, we modify the previous common practice in image classification, a linear classifier, with a cosine classifier which shows comparable performance. Second, we convert the image classification problem from learning parametric category classifier weights to learning a text encoder as a meta network to generate category classifier weights. The learnt text encoder is shared between image classification and image-text alignment. Third, we enrich each class name with a description to avoid confusion between classes and make the classification method closer to the image-text alignment. We prove that this deep fusion approach performs better on a variety of visual recognition tasks and setups than the individual learning or shallow fusion approach, from zero-shot/few-shot image classification, such as the Kornblith 12-dataset benchmark, to downstream tasks of action recognition, semantic segmentation, and object detection in fine-tuning and open-vocabulary settings. The code will be available at https://github.com/weiyx16/iCAR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题