通过通过视觉变压器产生扑克的面孔，情感分离和从面部表情中识别

论文标题

通过通过视觉变压器产生扑克的面孔，情感分离和从面部表情中识别

Emotion Separation and Recognition from a Facial Expression by Generating the Poker Face with Vision Transformers

论文作者

Li, Jia, Nie, Jiantao, Guo, Dan, Hong, Richang, Wang, Meng

论文摘要

表示学习和特征分离已经获得了面部表达识别领域（FER）领域的重大研究兴趣。情感标签的固有歧义为常规监督表示方法带来了挑战。此外，直接学习从面部表达图像到情感标签的映射缺乏明确的监督信号，无法捕获细粒度的面部特征。在本文中，我们提出了一种新颖的FER模型，称为Poker Face Vision Transformer或PF-Vit，以应对这些挑战。 PF-VIT旨在通过生成相应的扑克面部，而无需配对图像，旨在通过产生其相应的扑克面部来将扰动 - 不合时宜的情绪与静态面部图像分开。受面部动作编码系统的启发，我们将表现力的面孔视为扑克脸上一组面部肌肉运动的综合结果（即无情感的面孔）。 PF-VIT利用了香草视觉变压器，其组件首先在没有情感标签的大型面部表达数据集上作为蒙面自动编码器进行预训练，从而产生出色的表示。随后，我们使用GAN框架训练PF-VIT。在培训期间，戳式面部生成的辅助任务促进了情感和情感含量的组件之间的分离，从而指导FER模型从整体上捕获歧视性的面部细节。定量和定性结果证明了我们方法的有效性，超过了四个流行的FER数据集上的最新方法。

Representation learning and feature disentanglement have garnered significant research interest in the field of facial expression recognition (FER). The inherent ambiguity of emotion labels poses challenges for conventional supervised representation learning methods. Moreover, directly learning the mapping from a facial expression image to an emotion label lacks explicit supervision signals for capturing fine-grained facial features. In this paper, we propose a novel FER model, named Poker Face Vision Transformer or PF-ViT, to address these challenges. PF-ViT aims to separate and recognize the disturbance-agnostic emotion from a static facial image via generating its corresponding poker face, without the need for paired images. Inspired by the Facial Action Coding System, we regard an expressive face as the combined result of a set of facial muscle movements on one's poker face (i.e., an emotionless face). PF-ViT utilizes vanilla Vision Transformers, and its components are firstly pre-trained as Masked Autoencoders on a large facial expression dataset without emotion labels, yielding excellent representations. Subsequently, we train PF-ViT using a GAN framework. During training, the auxiliary task of poke face generation promotes the disentanglement between emotional and emotion-irrelevant components, guiding the FER model to holistically capture discriminative facial details. Quantitative and qualitative results demonstrate the effectiveness of our method, surpassing the state-of-the-art methods on four popular FER datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题