通过视觉语言提示，零拍动的时间动作检测

论文标题

通过视觉语言提示，零拍动的时间动作检测

Zero-Shot Temporal Action Detection via Vision-Language Prompting

论文作者

Nag, Sauradip, Zhu, Xiatian, Song, Yi-Zhe, Xiang, Tao

论文摘要

现有的时间动作检测（TAD）方法依赖于大型培训数据，包括分段级注释，仅限于在推断期间单独识别先前看到的课程。为每类兴趣收集和注释一个大型培训集是昂贵的，因此无法实现。零射TAD（ZS-TAD）通过启用预先训练的模型来识别任何看不见的动作类别来解决这一障碍。同时，ZS-Tad的调查大大降低了，ZS-Tad也更具挑战性。受零摄像图像分类的成功的启发，我们旨在解决更复杂的TAD任务。一种直观的方法是将现成的提案检测器与剪辑样式分类集成在一起。但是，由于顺序定位（例如，提案生成）和分类设计，它很容易进行定位误差传播。为了克服这个问题，在本文中，我们通过视觉提示（Stale）提出了一种新型的零射击时间动作检测模型。这种新颖的设计通过打破介于两者之间的误差传播途径来有效地消除了定位和分类之间的依赖性。我们进一步介绍了分类和定位之间的相互作用机制，以改善优化。对标准ZS-TAD视频基准测试的广泛实验表明，我们的陈旧的表现明显优于最先进的替代方案。此外，我们的模型还与最近的强大竞争对手相比，在有监督的TAD上取得了卓越的成果。 Stale的Pytorch实现可从https://github.com/sauradip/stale获得。

Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g, proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms state-of-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors. The PyTorch implementation of STALE is available at https://github.com/sauradip/STALE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题