论文标题
BOFFIN TTS:贝叶斯优化的很少的扬声器改编
BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization
论文作者
论文摘要
我们提出了Boffin TTS(贝叶斯对言语的微调神经文本的优化),这是一种新颖的扬声器适应性方法。在这里,任务是微调预训练的TTS模型,以使用一小片目标话语模仿新的扬声器。我们证明,不存在一定大小的适应策略,令人信服的合成需要对控制微调的超参数进行特定于语料库的配置。通过使用贝叶斯优化来有效地优化目标扬声器的这些高参数值,我们能够进行适应性,平均提高了扬声器相似性比标准技术的30%。结果表明,在多个语料库中,Boffin TTS可以学习使用不到十分钟的音频来合成新的扬声器,从而获得与用于训练基本模型的扬声器相同的自然性。
We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to efficiently optimize these hyper-parameter values for a target speaker, we are able to perform adaptation with an average 30% improvement in speaker similarity over standard techniques. Results indicate, across multiple corpora, that BOFFIN TTS can learn to synthesize new speakers using less than ten minutes of audio, achieving the same naturalness as produced for the speakers used to train the base model.