基于生成的对抗网络的并行波形合成，并具有声音感知的条件歧视器

论文标题

基于生成的对抗网络的并行波形合成，并具有声音感知的条件歧视器

Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators

论文作者

Yamamoto, Ryuichi, Song, Eunwoo, Hwang, Min-Jae, Kim, Jae-Min

论文摘要

本文提出了有关基于波浪井的波形合成系统的语音感知条件歧视器。在此框架中，我们采用了一种基于投影的调节方法，可以显着改善歧视者的性能。此外，传统的歧视者被分为两个波形歧视因子，用于建模和无声的语音。当每个歧视者分别了解谐波和噪声组件的独特特征时，对抗性训练过程变得更加有效，从而使发电机能够产生更真实的语音波形。主观测试结果证明了所提出的方法优于常规平行波形和波纳特系统。特别是，我们在Fastspech 2基于文本到语音的框架中受到培训的训练模型的平均意见分别分别为4.20、4.18、4.21和4.31，分别为四个日语说话者。

This paper proposes voicing-aware conditional discriminators for Parallel WaveGAN-based waveform synthesis systems. In this framework, we adopt a projection-based conditioning method that can significantly improve the discriminator's performance. Furthermore, the conventional discriminator is separated into two waveform discriminators for modeling voiced and unvoiced speech. As each discriminator learns the distinctive characteristics of the harmonic and noise components, respectively, the adversarial training process becomes more efficient, allowing the generator to produce more realistic speech waveforms. Subjective test results demonstrate the superiority of the proposed method over the conventional Parallel WaveGAN and WaveNet systems. In particular, our speaker-independently trained model within a FastSpeech 2 based text-to-speech framework achieves the mean opinion scores of 4.20, 4.18, 4.21, and 4.31 for four Japanese speakers, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题