探索埃及阿拉伯语英语文本的神经机器翻译的细分方法

论文标题

探索埃及阿拉伯语英语文本的神经机器翻译的细分方法

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

论文作者

Gaser, Marwa, Mager, Manuel, Hamed, Injy, Habash, Nizar, Abdennadher, Slim, Vu, Ngoc Thang

论文摘要

数据稀疏性是代码转换（CS）提出的主要挑战之一，在形态上富含语言的情况下，这进一步加剧了。对于机器翻译（MT）的任务，形态分割已被证明在减轻单语言环境中的数据稀疏方面已被证明是成功的。但是，尚未针对CS设置进行调查。在本文中，我们研究了不同分割方法对MT性能的有效性，涵盖了基于形态和基于频率的分割技术。我们在MT上实验从代码开关的阿拉伯语英语到英语。我们提供详细的分析，检查各种条件，例如数据大小和CS不同程度的句子。经验结果表明，形态感知的细分器在分割任务中执行最佳，但在MT中表现不佳。但是，我们发现用于MT的分割设置的选择高度取决于数据大小。对于极端的低资源场景，证明频率和基于形态的分割的组合表明可以表现最佳。对于更多资源的设置，这种组合不会在使用基于频率的细分方面带来重大改进。

Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题