论文标题
Ferv39k:一个大型多场曲子数据集,用于视频中的面部表情识别
FERV39k: A Large-Scale Multi-Scene Dataset for Facial Expression Recognition in Videos
论文作者
论文摘要
面部表达识别(FER)的当前基准主要集中在静态图像上,而视频中的FER数据集有限。评估现有方法的性能是否在现实世界中面向应用程序的场景中仍然令人满意仍然是模棱两可的。例如,在脱口秀中强烈强度的“快乐”表达比官方事件强度低的同一表达更具歧视性。为了填补这一空白,我们构建了一个大规模的多场合数据集,以Ferv39k的形式创造。我们分析了在三个方面构建这种新型数据集的重要成分:(1)多场景层次结构和表达式类别,(2)生成候选视频剪辑,(3)受信任的手动标记过程。根据这些准则,我们选择了4种细分为22个场景,注释的86K样本自动从基于精心设计的工作流程的4K视频中获得,最后构建了38,935个视频剪辑,标有7个经典表达式。还提供了四种基线框架的实验基准,并进一步分析了它们在不同场景中的性能,并为将来的研究带来了一些挑战。此外,我们通过消融研究系统地研究了DFER的关键组成部分。基线框架和我们的项目将提供。
Current benchmarks for facial expression recognition (FER) mainly focus on static images, while there are limited datasets for FER in videos. It is still ambiguous to evaluate whether performances of existing methods remain satisfactory in real-world application-oriented scenes. For example, the "Happy" expression with high intensity in Talk-Show is more discriminating than the same expression with low intensity in Official-Event. To fill this gap, we build a large-scale multi-scene dataset, coined as FERV39k. We analyze the important ingredients of constructing such a novel dataset in three aspects: (1) multi-scene hierarchy and expression class, (2) generation of candidate video clips, (3) trusted manual labelling process. Based on these guidelines, we select 4 scenarios subdivided into 22 scenes, annotate 86k samples automatically obtained from 4k videos based on the well-designed workflow, and finally build 38,935 video clips labeled with 7 classic expressions. Experiment benchmarks on four kinds of baseline frameworks were also provided and further analysis on their performance across different scenes and some challenges for future research were given. Besides, we systematically investigate key components of DFER by ablation studies. The baseline framework and our project will be available.