通过视觉变压器的简单开放式视频对象检测

论文标题

通过视觉变压器的简单开放式视频对象检测

Simple Open-Vocabulary Object Detection with Vision Transformers

论文作者

Minderer, Matthias, Gritsenko, Alexey, Stone, Austin, Neumann, Maxim, Weissenborn, Dirk, Dosovitskiy, Alexey, Mahendran, Aravindh, Arnab, Anurag, Dehghani, Mostafa, Shen, Zhuoran, Wang, Xiao, Zhai, Xiaohua, Kipf, Thomas, Houlsby, Neil

论文摘要

将简单的体系结构与大规模的预训练相结合，导致了图像分类的大规模改进。为了进行对象检测，训练和缩放方法的确定性不佳，尤其是在长尾和开放式摄影的环境中，训练数据相对较少。在本文中，我们提出了一个强大的配方，用于将图像文本模型转移到开放式唱机对象检测中。我们使用具有最小修改，对比度文本预训练和端到端检测微调的标准视觉变压器体系结构。我们对该设置的缩放属性的分析表明，增加图像级的预训练和模型大小在下游检测任务上产生一致的改进。我们提供适应性策略和正规化，以在零击文本条件和单次图像条件的对象检测上实现非常强大的性能。代码和型号可在GitHub上找到。

Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.

下载PDF全文

下载文献需遵守相关版权规定

论文标题