Styletts：一种基于样式的生成模型，用于自然和多样的文本到语音综合

论文标题

Styletts：一种基于样式的生成模型，用于自然和多样的文本到语音综合

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

论文作者

Li, Yinghao Aaron, Han, Cong, Mesgarani, Nima

论文摘要

由于平行TTS系统的快速发展，文本到语音（TTS）最近在综合高质量的语音方面取得了长足进步，但是通过自然主义的韵律变化，口语风格和情感色调发表语音仍然具有挑战性。此外，由于持续时间和言语是单独生成的，因此平行的TTS模型仍然存在问题，发现最佳的单调比对对于自然主义语音综合至关重要。在这里，我们提出了Styletts，这是一种基于样式的生成模型，用于平行TT，可以将各种语音与参考语音发言的自然韵律合成。在新型可转移的单调对准器（TMA）和持续不变的数据增强方案中，我们的方法在言语自然和说话者相似性的主观测试中，在单个和多扬声器数据集中都显着超过了最先进的模型。通过对口语风格的自我监督学习，我们的模型可以用与任何给定的参考语音相同的韵律和情感语气综合语音，而无需明确标记这些类别。

Text-to-Speech (TTS) has recently seen great progress in synthesizing high-quality speech owing to the rapid development of parallel TTS systems, but producing speech with naturalistic prosodic variations, speaking styles and emotional tones remains challenging. Moreover, since duration and speech are generated separately, parallel TTS models still have problems finding the best monotonic alignments that are crucial for naturalistic speech synthesis. Here, we propose StyleTTS, a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance. With novel Transferable Monotonic Aligner (TMA) and duration-invariant data augmentation schemes, our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets in subjective tests of speech naturalness and speaker similarity. Through self-supervised learning of the speaking styles, our model can synthesize speech with the same prosodic and emotional tone as any given reference speech without the need for explicitly labeling these categories.

下载PDF全文

下载文献需遵守相关版权规定

论文标题