论文标题

deresterfultts 2:端到端语音合成与对抗矢量量化的自动编码器

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

论文作者

Liu, Yanqing, Xue, Ruiqing, He, Lei, Tan, Xu, Zhao, Sheng

论文摘要

当前对语音(TTS)系统的文本通常利用级联的声学模型和Vocoder管道,其旋律光谱图作为中间表示,这有两个局限性:1)分别对声学模型和Vocoder进行了训练,而不是联合优化,这使级联错误; 2)中间语音表示(例如MEL-SPECTROGRAM)是预先设计的,并且丢失了相比最佳的相位信息。为了解决这些问题,在本文中,我们开发了令人愉悦的Fultttts 2,这是一种新的端到端语音合成系统,具有自动学习的语音表示,并共同优化了声学模型和Vocoder。具体而言,1)我们提出了一个基于对矢量定量的自动编码器(VQ-GAN)的新编解码网络,以提取中间框架级别的语音表示(而不是传统表示)和重建语音波形; 2)我们共同优化了声学模型(基于DesightfulTTS)和Vocoder(VQ-GAN的解码器),并在声学模型上辅助损失,以预测中间语音表示。实验表明,令人愉悦的Fulttts 2比Desightultts获得了CMO的增益+0.14,更多的方法分析进一步验证了已开发系统的有效性。

Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors; 2) the intermediate speech representations (e.g., mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal. To solve these problems, in this paper, we develop DelightfulTTS 2, a new end-to-end speech synthesis system with automatically learned speech representations and jointly optimized acoustic model and vocoder. Specifically, 1) we propose a new codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations (instead of traditional representations like mel-spectrograms) and reconstruct speech waveform; 2) we jointly optimize the acoustic model (based on DelightfulTTS) and the vocoder (the decoder of VQ-GAN), with an auxiliary loss on the acoustic model to predict intermediate speech representations. Experiments show that DelightfulTTS 2 achieves a CMOS gain +0.14 over DelightfulTTS, and more method analyses further verify the effectiveness of the developed system.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源