VOCGAN：具有层次结构的对抗网络的高保真实时声码器

论文标题

VOCGAN：具有层次结构的对抗网络的高保真实时声码器

VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

论文作者

Yang, Jinhyeok, Lee, Junmo, Kim, Youngik, Cho, Hoonyoung, Kim, Injung

论文摘要

我们介绍了一个新颖的高保真实时神经声录，称为Vocgan。最近开发的基于GAN的Vocoder梅尔根（Melgan）实时产生语音波形。但是，它通常会产生一种波形，该波形在质量不足或与输入MEL频谱图的声学特性不一致。 Vocgan几乎与梅尔根一样快，但是它显着提高了输出波形的质量和一致性。 VOCGAN应用多尺度波形生成器和一个层次的歧视器以平衡的方式学习多个级别的声学属性。它还应用了关节条件和无条件目标，该目标在高分辨率图像合成中显示了成功的结果。在实验中，VOCGAN在GTX 1080TI GPU上综合了语音波形416.7倍，而在CPU上综合了3.24倍的语音波形，比实时综合了3.24倍。与梅尔根（Melgan）相比，它在多个评估指标（包括平均意见评分（MOS））中的质量也有明显提高，其额外的开销很少。此外，与另一个最近开发的高保真声码器的平行波甘（Wavegan）相比，VOCGAN在CPU上的速度快6.98倍，并且显示出更高的MOS。

We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time. Compared with MelGAN, it also exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead. Additionally, compared with Parallel WaveGAN, another recently developed high-fidelity vocoder, VocGAN is 6.98x faster on a CPU and exhibits higher MOS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题