deresterfultts 2：端到端语音合成与对抗矢量量化的自动编码器

论文标题

deresterfultts 2：端到端语音合成与对抗矢量量化的自动编码器

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

论文作者

Liu, Yanqing, Xue, Ruiqing, He, Lei, Tan, Xu, Zhao, Sheng

论文摘要

当前对语音（TTS）系统的文本通常利用级联的声学模型和Vocoder管道，其旋律光谱图作为中间表示，这有两个局限性：1）分别对声学模型和Vocoder进行了训练，而不是联合优化，这使级联错误; 2）中间语音表示（例如MEL-SPECTROGRAM）是预先设计的，并且丢失了相比最佳的相位信息。为了解决这些问题，在本文中，我们开发了令人愉悦的Fultttts 2，这是一种新的端到端语音合成系统，具有自动学习的语音表示，并共同优化了声学模型和Vocoder。具体而言，1）我们提出了一个基于对矢量定量的自动编码器（VQ-GAN）的新编解码网络，以提取中间框架级别的语音表示（而不是传统表示）和重建语音波形； 2）我们共同优化了声学模型（基于DesightfulTTS）和Vocoder（VQ-GAN的解码器），并在声学模型上辅助损失，以预测中间语音表示。实验表明，令人愉悦的Fulttts 2比Desightultts获得了CMO的增益+0.14，更多的方法分析进一步验证了已开发系统的有效性。

Current text to speech (TTS) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediate representations, which suffer from two limitations: 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors; 2) the intermediate speech representations (e.g., mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal. To solve these problems, in this paper, we develop DelightfulTTS 2, a new end-to-end speech synthesis system with automatically learned speech representations and jointly optimized acoustic model and vocoder. Specifically, 1) we propose a new codec network based on vector-quantized auto-encoders with adversarial training (VQ-GAN) to extract intermediate frame-level speech representations (instead of traditional representations like mel-spectrograms) and reconstruct speech waveform; 2) we jointly optimize the acoustic model (based on DelightfulTTS) and the vocoder (the decoder of VQ-GAN), with an auxiliary loss on the acoustic model to predict intermediate speech representations. Experiments show that DelightfulTTS 2 achieves a CMOS gain +0.14 over DelightfulTTS, and more method analyses further verify the effectiveness of the developed system.

下载PDF全文

下载文献需遵守相关版权规定

论文标题