论文标题
转换任何人的情绪:转向与说话者无关的情感语音转换
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion
论文作者
论文摘要
情感语音转换旨在将语音的情感从一个状态转换为另一个状态,同时保留语言内容和说话者的身份。先前关于情感语音转换的研究主要是在情感依赖说话者的假设下进行的。我们认为,说话者在语言中表达情绪表达的人之间存在一个共同的代码,因此,情绪状态之间的独立映射是可能的。在本文中,我们提出了一个与扬声器无关的情感语音转换框架,可以在不需要并行数据的情况下转换任何人的情绪。我们提出了一个基于VAW-GAN的编码器结构,以学习频谱和韵律映射。我们通过使用连续小波变换(CWT)对时间依赖性进行建模来执行韵律转换。我们还调查了F0作为解码器的额外输入,以改善情绪转换性能。实验表明,所提出的与说话者无关的框架为可见的和看不见的说话者都取得了竞争成果。
Emotional voice conversion aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. The prior studies on emotional voice conversion are mostly carried out under the assumption that emotion is speaker-dependent. We consider that there is a common code between speakers for emotional expression in a spoken language, therefore, a speaker-independent mapping between emotional states is possible. In this paper, we propose a speaker-independent emotional voice conversion framework, that can convert anyone's emotion without the need for parallel data. We propose a VAW-GAN based encoder-decoder structure to learn the spectrum and prosody mapping. We perform prosody conversion by using continuous wavelet transform (CWT) to model the temporal dependencies. We also investigate the use of F0 as an additional input to the decoder to improve emotion conversion performance. Experiments show that the proposed speaker-independent framework achieves competitive results for both seen and unseen speakers.