MDAN：视觉情绪分析的多层次依赖注意网络

论文标题

MDAN：视觉情绪分析的多层次依赖注意网络

MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis

论文作者

Xu, Liwen, Wang, Zhengtao, Wu, Bin, Lui, Simon

论文摘要

视觉情绪分析（VEA）正在吸引越来越多的关注。 VEA的最大挑战之一是弥合图片中视觉线索与图片表达的情感之间的情感差距。随着情绪的粒度的增加，情感差距也会增加。现有的深层方法试图通过直接在全球范围内直接学习情绪之间的歧视来弥合差距，而无需考虑不同情感水平的情绪之间的等级关系以及要分类的情感水平。在本文中，我们介绍了两个分支的多层次依赖注意网络（MDAN），以利用不同情感水平和语义水平之间的情绪层次结构和相关性。自下而上的分支直接在情感水平上学习情绪，并严格遵循情绪等级，同时预测情绪较低的情绪。相反，自上而下的分支试图通过语义水平和情感水平之间的一对一映射（即情感语义映射）来消除情感差距。在每个语义层面上，本地分类器都学会在相应的情感层面上歧视情绪。最后，我们将全球学习和本地学习整合到一个统一的深层框架中，并同时优化网络。此外，为了正确提取和利用频道依赖性和空间注意力，同时散布了情感差距，我们仔细设计了两个注意力模块：多头跨通道注意模块和级别依赖性的类激活图模块。最后，拟议的深框架在六个VEA基准上获得了新的最先进的性能，在该基准测试中，它以25类分类精度的网络ELBIN数据集中的 +3.85％的网络数据集中优于现有的最新方法，例如， +3.85％。

Visual Emotion Analysis (VEA) is attracting increasing attention. One of the biggest challenges of VEA is to bridge the affective gap between visual clues in a picture and the emotion expressed by the picture. As the granularity of emotions increases, the affective gap increases as well. Existing deep approaches try to bridge the gap by directly learning discrimination among emotions globally in one shot without considering the hierarchical relationship among emotions at different affective levels and the affective level of emotions to be classified. In this paper, we present the Multi-level Dependent Attention Network (MDAN) with two branches, to leverage the emotion hierarchy and the correlation between different affective levels and semantic levels. The bottom-up branch directly learns emotions at the highest affective level and strictly follows the emotion hierarchy while predicting emotions at lower affective levels. In contrast, the top-down branch attempt to disentangle the affective gap by one-to-one mapping between semantic levels and affective levels, namely, Affective Semantic Mapping. At each semantic level, a local classifier learns discrimination among emotions at the corresponding affective level. Finally, We integrate global learning and local learning into a unified deep framework and optimize the network simultaneously. Moreover, to properly extract and leverage channel dependencies and spatial attention while disentangling the affective gap, we carefully designed two attention modules: the Multi-head Cross Channel Attention module and the Level-dependent Class Activation Map module. Finally, the proposed deep framework obtains new state-of-the-art performance on six VEA benchmarks, where it outperforms existing state-of-the-art methods by a large margin, e.g., +3.85% on the WEBEmo dataset at 25 classes classification accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题