泰语句子的神经机器翻译的多个段

论文标题

泰语句子的神经机器翻译的多个段

Multiple Segmentations of Thai Sentences for Neural Machine Translation

论文作者

Poncelas, Alberto, Pidchamook, Wichaya, Liu, Chao-Hong, Hadley, James, Way, Andy

论文摘要

泰语是一种低资源语言，因此通常没有足够数量的数据来训练高质量质量水平的神经机器翻译（NMT）模型。此外，泰语脚本不使用白色空间来界定单词之间的边界，这在构建序列模型时会增加更复杂的性能。在这项工作中，我们探讨了如何通过在泰语上使用不同的单词分割方法复制句子对来增强一组英语数据，作为NMT模型培训的培训数据。使用字节对编码的不同合并操作，可以获得泰语的不同分割。实验表明，将这些数据集结合起来，改进了使用有监督的分裂工具进行拆分的数据集训练的NMT模型。

Thai is a low-resource language, so it is often the case that data is not available in sufficient quantities to train an Neural Machine Translation (NMT) model which perform to a high level of quality. In addition, the Thai script does not use white spaces to delimit the boundaries between words, which adds more complexity when building sequence to sequence models. In this work, we explore how to augment a set of English--Thai parallel data by replicating sentence-pairs with different word segmentation methods on Thai, as training data for NMT model training. Using different merge operations of Byte Pair Encoding, different segmentations of Thai sentences can be obtained. The experiments show that combining these datasets, performance is improved for NMT models trained with a dataset that has been split using a supervised splitting tool.

下载PDF全文

下载文献需遵守相关版权规定

论文标题