论文标题

在有限数据的情况下

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

论文作者

Cai, Zexin, Yang, Yaogen, Li, Ming

论文摘要

在一个文本到语音系统中,为多种扬声器和多种语言建模声音一直是长期以来的挑战。本文介绍了Tacotron2上的扩展,以实现每种语言数据有限时,可以实现双语多言语语音综合。我们实现了英语和普通话之间的跨语性合成,包括代码转换案例,用于单语言扬声器。两种语言共享相同的音素表示输入,而语言属性和说话者身份分别由语言令牌和说话者嵌入独立控制。此外,我们研究了模型在跨语性合成中的性能,在训练过程中有和没有双语数据集。借助双语数据集,该模型不仅可以为所有有关他们所说的语言的演讲者产生高保真的演讲,而且还可以为关于非本地语言的单语言讲者产生重音,但流利而易于理解的语音。例如,普通话的扬声器可以流利地说英语。此外,经过双语数据集训练的模型对于代码开关的文本转换为Speech,如我们的结果所示,并提供了样本。

Modeling voices for multiple speakers and multiple languages in one text-to-speech system has been a challenge for a long time. This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis when there are limited data for each language. We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers. The two languages share the same phonemic representations for input, while the language attribute and the speaker identity are independently controlled by language tokens and speaker embeddings, respectively. In addition, we investigate the model's performance on the cross-lingual synthesis, with and without a bilingual dataset during training. With the bilingual dataset, not only can the model generate high-fidelity speech for all speakers concerning the language they speak, but also can generate accented, yet fluent and intelligible speech for monolingual speakers regarding non-native language. For example, the Mandarin speaker can speak English fluently. Furthermore, the model trained with bilingual dataset is robust for code-switching text-to-speech, as shown in our results and provided samples.{https://caizexin.github.io/mlms-syn-samples/index.html}.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源