面向少数族裔的附近扩张，并带有专注的视频识别

论文标题

面向少数族裔的附近扩张，并带有专注的视频识别

Minority-Oriented Vicinity Expansion with Attentive Aggregation for Video Long-Tailed Recognition

论文作者

Moon, WonJun, Seong, Hyun Seok, Heo, Jae-Pil

论文摘要

现实世界中的视频量的急剧增加，具有极其多样化和新兴的主题自然形成了长尾视频分发，而这些视频分布在其类别方面，引起了人们对视频长尾识别（VLTR）的需求。在这项工作中，我们总结了VLTR中的挑战，并探讨了如何克服它们。挑战是：（1）重新培训高质量功能的整个模型是不切实际的，（2）获取框架标签需要大量的成本，并且（3）长尾数据触发了有偏见的培训。然而，不可避免地，大多数现有的VLTR作品都利用了从任务限制的验证模型中提取的图像级功能，并通过视频级标签学习。因此，为了处理（1）任务 - 近乎二元的功能和（2）视频级标签，我们介绍了两个互补的可学习功能聚合器。每个聚合器中的可学习层都是产生与任务相关的表示形式，每个聚合器都是将摘要知识组装成视频代表。然后，我们提出面向少数族裔的附近扩张（移动），以明确利用类频率近似附近分布来减轻（3）偏见的培训。通过结合这些解决方案，我们的方法可以在大规模视频和合成诱导的不平衡的Minikinetics200上实现最先进的结果。借助Resnet-50的视频功能，与以前的最新方法相比，它的头部和尾巴相对相对改善的相对改善分别达到了18％和58％。

A dramatic increase in real-world video volume with extremely diverse and emerging topics naturally forms a long-tailed video distribution in terms of their categories, and it spotlights the need for Video Long-Tailed Recognition (VLTR). In this work, we summarize the challenges in VLTR and explore how to overcome them. The challenges are: (1) it is impractical to re-train the whole model for high-quality features, (2) acquiring frame-wise labels requires extensive cost, and (3) long-tailed data triggers biased training. Yet, most existing works for VLTR unavoidably utilize image-level features extracted from pretrained models which are task-irrelevant, and learn by video-level labels. Therefore, to deal with such (1) task-irrelevant features and (2) video-level labels, we introduce two complementary learnable feature aggregators. Learnable layers in each aggregator are to produce task-relevant representations, and each aggregator is to assemble the snippet-wise knowledge into a video representative. Then, we propose Minority-Oriented Vicinity Expansion (MOVE) that explicitly leverages the class frequency into approximating the vicinity distributions to alleviate (3) biased training. By combining these solutions, our approach achieves state-of-the-art results on large-scale VideoLT and synthetically induced Imbalanced-MiniKinetics200. With VideoLT features from ResNet-50, it attains 18% and 58% relative improvements on head and tail classes over the previous state-of-the-art method, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题