VAW-GAN在语音中的分解和情感元素重新分配

论文标题

VAW-GAN在语音中的分解和情感元素重新分配

VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech

论文作者

Zhou, Kun, Sisman, Berrak, Li, Haizhou

论文摘要

情感语音转换（EVC）旨在将语音的情感从一个状态转变为另一个状态，同时保留语言内容和说话者的身份。在本文中，我们通过各种自动编码Wasserstein生成对抗网络（VAW-GAN）研究语音中情绪元素的分离和重新分解。我们提出了一个基于VAW-GAN的依赖说话者的EVC框架，其中包括两个VAW-GAN管道，一个用于频谱转换，另一个用于韵律转换。我们训练一个光谱编码器，该编码器将情感和韵律（F0）信息与光谱特征相关。我们还训练了一个韵律编码器，该编码器将韵律（情感韵律）的情绪调节与语言韵律中解脱出来。在运行时，光谱VAW-GAN的解码器以韵律VAW-GAN的输出为条件。 Vocoder采用转换后的光谱和韵律特征来产生目标情感语音。实验验证了我们提出的方法在客观和主观评估中的有效性。

Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. In this paper, we study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion. We train a spectral encoder that disentangles emotion and prosody (F0) information from spectral features; we also train a prosodic encoder that disentangles emotion modulation of prosody (affective prosody) from linguistic prosody. At run-time, the decoder of spectral VAW-GAN is conditioned on the output of prosodic VAW-GAN. The vocoder takes the converted spectral and prosodic features to generate the target emotional speech. Experiments validate the effectiveness of our proposed method in both objective and subjective evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题