GenerSpeech：朝向风格转移，以换取可推广的外域文本到语音

论文标题

GenerSpeech：朝向风格转移，以换取可推广的外域文本到语音

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

论文作者

Huang, Rongjie, Ren, Yi, Liu, Jinglin, Cui, Chenye, Zhao, Zhou

论文摘要

室外（OOD）语音综合的样式转移旨在生成来自声学参考的风格（例如，说话者的身份，情感和韵律）的语音样本（例如，扬声器的身份，情感和韵律），同时面临以下挑战，同时面临以下挑战：1）表达语音中的高度动态风格特征很难建模和转移； 2）TTS模型应足够强大，以处理与源数据不同的不同OOD条件。本文提出了GenerSpeech，这是一种文本到语音模型，旨在使用OOD自定义语音的高保真零拍传输。 GenerSpeech通过引入两个组件来将语音变化分解为样式不合时宜的和特定于样式的部分：1）多层样式适配器，以有效地对各种样式条件进行建模，包括全球扬声器和情感特征，以及本地（语音，音素和文字级别和文字级别的刺激性刺激性刺激性）； 2）具有混合层归一化的可推广内容适配器，以消除语言内容表示中的样式信息，从而改善模型的概括。我们对零拍动样式转移的评估表明，GenerSpeech在音频质量和样式相似性方面超过了最先进的模型。扩展研究的自适应风格转移进一步表明，GenerSpeech在几个弹药数据设置中表现出色。音频样本可在https://generspeech.github.io/上找到

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting. Audio samples are available at https://GenerSpeech.github.io/

下载PDF全文

下载文献需遵守相关版权规定

论文标题