论文标题

面向少数族裔的附近扩张,并带有专注的视频识别

Minority-Oriented Vicinity Expansion with Attentive Aggregation for Video Long-Tailed Recognition

论文作者

Moon, WonJun, Seong, Hyun Seok, Heo, Jae-Pil

论文摘要

现实世界中的视频量的急剧增加,具有极其多样化和新兴的主题自然形成了长尾视频分发,而这些视频分布在其类别方面,引起了人们对视频长尾识别(VLTR)的需求。在这项工作中,我们总结了VLTR中的挑战,并探讨了如何克服它们。挑战是:(1)重新培训高质量功能的整个模型是不切实际的,(2)获取框架标签需要大量的成本,并且(3)长尾数据触发了有偏见的培训。然而,不可避免地,大多数现有的VLTR作品都利用了从任务限制的验证模型中提取的图像级功能,并通过视频级标签学习。因此,为了处理(1)任务 - 近乎二元的功能和(2)视频级标签,我们介绍了两个互补的可学习功能聚合器。每个聚合器中的可学习层都是产生与任务相关的表示形式,每个聚合器都是将摘要知识组装成视频代表。然后,我们提出面向少数族裔的附近扩张(移动),以明确利用类频率近似附近分布来减轻(3)偏见的培训。通过结合这些解决方案,我们的方法可以在大规模视频和合成诱导的不平衡的Minikinetics200上实现最先进的结果。借助Resnet-50的视频功能,与以前的最新方法相比,它的头部和尾巴相对相对改善的相对改善分别达到了18%和58%。

A dramatic increase in real-world video volume with extremely diverse and emerging topics naturally forms a long-tailed video distribution in terms of their categories, and it spotlights the need for Video Long-Tailed Recognition (VLTR). In this work, we summarize the challenges in VLTR and explore how to overcome them. The challenges are: (1) it is impractical to re-train the whole model for high-quality features, (2) acquiring frame-wise labels requires extensive cost, and (3) long-tailed data triggers biased training. Yet, most existing works for VLTR unavoidably utilize image-level features extracted from pretrained models which are task-irrelevant, and learn by video-level labels. Therefore, to deal with such (1) task-irrelevant features and (2) video-level labels, we introduce two complementary learnable feature aggregators. Learnable layers in each aggregator are to produce task-relevant representations, and each aggregator is to assemble the snippet-wise knowledge into a video representative. Then, we propose Minority-Oriented Vicinity Expansion (MOVE) that explicitly leverages the class frequency into approximating the vicinity distributions to alleviate (3) biased training. By combining these solutions, our approach achieves state-of-the-art results on large-scale VideoLT and synthetically induced Imbalanced-MiniKinetics200. With VideoLT features from ResNet-50, it attains 18% and 58% relative improvements on head and tail classes over the previous state-of-the-art method, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源