论文标题
专家:使用基于注意的可变长度嵌入的文本到语音很少
Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
论文作者
论文摘要
由于对个性化的需求不断增长,因此需要一个只有少数数据的扬声器来推克隆的所谓的少数TTS系统。为了解决这个问题,我们提出了Attentron,这是一种几种弹药的TTS模型,该模型在培训过程中夹着说话者的声音。它介绍了两个特殊编码器,每个编码器都有不同的目的。细粒的编码器通过注意机制提取了可变长度的信息信息,而粗粒的编码器极大地稳定了语音综合,即使是综合看不见的扬声器的语音,也可以规避难以理解的gibberish。此外,该模型可以扩展到任意数量的参考音频,以提高合成语音的质量。根据我们的实验,包括人类评估,当在说话者的相似性和质量方面为看不见的说话者生成语音时,提出的模型在最先进的模型上都显着胜过。
On account of growing demands for personalization, the need for a so-called few-shot TTS system that clones speakers with only a few data is emerging. To address this issue, we propose Attentron, a few-shot TTS model that clones voices of speakers unseen during training. It introduces two special encoders, each serving different purposes. A fine-grained encoder extracts variable-length style information via an attention mechanism, and a coarse-grained encoder greatly stabilizes the speech synthesis, circumventing unintelligible gibberish even for synthesizing speech of unseen speakers. In addition, the model can scale out to an arbitrary number of reference audios to improve the quality of the synthesized speech. According to our experiments, including a human evaluation, the proposed model significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.