论文标题
强大的梅尔根:高保真TTS的强大通用神经声码器
Robust MelGAN: A robust universal neural vocoder for high-fidelity TTS
论文作者
论文摘要
在当前的两阶段神经文本到语音(TTS)范式中,拥有一旦训练的通用神经声码器,这是理想的,这是稳健的,可以从声学模型预测的不完美的MEL-SPECTROGRAM。为此,我们通过解决原始的多波段梅尔根的金属声音问题并提高其概括能力来提出强大的梅尔根声码器。具体来说,我们向发电机介绍了细粒网络辍学策略。通过专门设计的过度平滑处理程序,该处理程序将语音信号介绍周期性和周期性组件分开,我们只能对上的辍学组件进行网络辍学,从而减轻金属声音并保持良好的扬声器相似性。为了进一步提高概括能力,我们引入了几种数据增强方法,以增加歧视器中的虚假数据,包括谐波移动,谐波噪声和相位噪声。实验表明,强大的梅尔根可以用作通用声码器,在建立在各种类型数据上的TTS系统中可以显着提高声音质量。
In current two-stage neural text-to-speech (TTS) paradigm, it is ideal to have a universal neural vocoder, once trained, which is robust to imperfect mel-spectrogram predicted from the acoustic model. To this end, we propose Robust MelGAN vocoder by solving the original multi-band MelGAN's metallic sound problem and increasing its generalization ability. Specifically, we introduce a fine-grained network dropout strategy to the generator. With a specifically designed over-smooth handler which separates speech signal intro periodic and aperiodic components, we only perform network dropout to the aperodic components, which alleviates metallic sounding and maintains good speaker similarity. To further improve generalization ability, we introduce several data augmentation methods to augment fake data in the discriminator, including harmonic shift, harmonic noise and phase noise. Experiments show that Robust MelGAN can be used as a universal vocoder, significantly improving sound quality in TTS systems built on various types of data.