基于注意的兴趣区域（ROI）检测语音情感识别

论文标题

基于注意的兴趣区域（ROI）检测语音情感识别

Attention-based Region of Interest (ROI) Detection for Speech Emotion Recognition

论文作者

Desai, Jay, Cao, Houwei, Shah, Ravi

论文摘要

对现实应用的自动情绪识别是一项艰巨的任务。人类的情感表达是阿雷斯特尔，可以通过几种emo tions的结合来传达。在大多数现有的情感识别研究中，每个AudiOutter/视频剪辑都被标记/分类。然而，话语/剪贴级标签和分类可能太Coarseto捕获了微妙的内在/剪辑时间动力学。例如，话语/视频剪辑通常只包含少数多声明的区域和许多无情感的区域。在本研究中，我们建议使用深层复发性的网络中的注意机制来检测人类情感语音/视频中情感上更突出的利益区域（ROI），并通过对那些情绪上的兴趣区域进行混乱来进一步估计时间情绪动态。我们比较了音频和视频的ROI并分析它们。我们将提出的注意网络的性能与最终的LSTM模型进行了比较，以识别六种基本人类情绪的多类分类任务，而拟议的注意模型表现出明显更好的表现。此外，可以使用Theattention重量分布来解释如何将氧化剂表示为可能情绪的混合。

Automatic emotion recognition for real-life appli-cations is a challenging task. Human emotion expressions aresubtle, and can be conveyed by a combination of several emo-tions. In most existing emotion recognition studies, each audioutterance/video clip is labelled/classified in its entirety. However,utterance/clip-level labelling and classification can be too coarseto capture the subtle intra-utterance/clip temporal dynamics. Forexample, an utterance/video clip usually contains only a fewemotion-salient regions and many emotionless regions. In thisstudy, we propose to use attention mechanism in deep recurrentneural networks to detection the Regions-of-Interest (ROI) thatare more emotionally salient in human emotional speech/video,and further estimate the temporal emotion dynamics by aggre-gating those emotionally salient regions-of-interest. We comparethe ROI from audio and video and analyse them. We comparethe performance of the proposed attention networks with thestate-of-the-art LSTM models on multi-class classification task ofrecognizing six basic human emotions, and the proposed attentionmodels exhibit significantly better performance. Furthermore, theattention weight distribution can be used to interpret how anutterance can be expressed as a mixture of possible emotions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题