论文标题
OMDET:通过多模式检测网络预训练大规模视觉语言多数据库
OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network
论文作者
论文摘要
在开放式和开放世界情景中,对象检测(OD)的进步是计算机视觉中的一个关键挑战。这项工作介绍了OMDET,一种新颖的语言感知对象检测体系结构,以及一种创新的培训机制,可利用持续学习和多数据标志视觉语言预训练。 OMDET利用自然语言作为通用知识表示,积累了来自不同数据集的“视觉词汇”,将任务统一为语言条件的检测框架。我们的多模式检测网络(MDN)克服了多数据集联合培训的挑战,并将其推广到无需手动标签分类学合并的众多培训数据集。我们在野生,开放式摄影检测和短语接地中表明,OMDET在对象检测中表现出优于强基础的性能,从而实现了最新的结果。消融研究揭示了扩展训练前视觉词汇的影响,这表明有希望向大型数据集扩展的方向。我们深层融合方法的有效性强调了其从多个数据集共同学习的能力,从而通过知识共享提高了绩效。
The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. This work introduces OmDet, a novel language-aware object detection architecture, and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training. Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets, unifying the task as a language-conditioned detection framework. Our multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing.