BYT5模型，用于大规模多语言谱系转换

论文标题

BYT5模型，用于大规模多语言谱系转换

ByT5 model for massively multilingual grapheme-to-phoneme conversion

论文作者

Zhu, Jian, Zhang, Cong, Jurgens, David

论文摘要

在这项研究中，我们通过基于BYT5实施G2P模型来解决大规模多语言的谱系转换。我们已经从各种来源策划了一个G2P数据集，这些数据集涵盖了大约100种语言，并基于BYT5训练了大规模的多语言G2P模型。我们发现，在字节级输入上运行的BYT5在多语言G2P方面显着优于基于令牌的MT5模型。与这些语言中的单语模型的成对比较表明，多语言BYT5模型通常通过从各种语言中共同学习来降低电话错误率。预验证的模型可以通过对看不见的语言的零射击预测或提供预芬的重量来进一步使资源G2P进一步受益，这有助于该模型收敛到与随机初始化的权重相比，该模型收敛到较低的电话错误率。为了促进对多语言G2P的未来研究，我们在以下网址提供了代码和预读的多语言G2P模型：https：//github.com/lingjzhu/charsiug2p。

In this study, we tackle massively multilingual grapheme-to-phoneme conversion through implementing G2P models based on ByT5. We have curated a G2P dataset from various sources that covers around 100 languages and trained large-scale multilingual G2P models based on ByT5. We found that ByT5 operating on byte-level inputs significantly outperformed the token-based mT5 model in terms of multilingual G2P. Pairwise comparison with monolingual models in these languages suggests that multilingual ByT5 models generally lower the phone error rate by jointly learning from a variety of languages. The pretrained model can further benefit low resource G2P through zero-shot prediction on unseen languages or provides pretrained weights for finetuning, which helps the model converge to a lower phone error rate than randomly initialized weights. To facilitate future research on multilingual G2P, we make available our code and pretrained multilingual G2P models at: https://github.com/lingjzhu/CharsiuG2P.

下载PDF全文

下载文献需遵守相关版权规定

论文标题