从多个来源选择重新翻译数据，以改善神经机器翻译

论文标题

从多个来源选择重新翻译数据，以改善神经机器翻译

Selecting Backtranslated Data from Multiple Sources for Improved Neural Machine Translation

论文作者

Soto, Xabier, Shterionov, Dimitar, Poncelas, Alberto, Way, Andy

论文摘要

机器翻译（MT）受益于使用源自翻译单语库的合成训练数据，这是一种称为倒退的技术。从不同来源组合反向翻译的数据已导致比孤立数据相比，取得了更好的结果。在这项工作中，我们分析了数据通过基于规则的，基于短语的统计和神经MT系统对新MT系统具有的影响。我们使用现实世界中的低资源用例（在临床领域的巴斯克到西班牙）以及一对高资源语言对（德语到英语）来测试不同的场景，并采用数据选择来优化合成公司。我们利用不同的数据选择策略，以减少所使用的数据量，同时维护高质量的MT系统。我们通过考虑用于撤退和词汇多样性的MT系统的质量来进一步调整数据选择方法。我们的实验表明，将来自不同来源的倒译数据合并可能是有益的，并且使用数据选择可以提高性能。

Machine translation (MT) has benefited from using synthetic training data originating from translating monolingual corpora, a technique known as backtranslation. Combining backtranslated data from different sources has led to better results than when using such data in isolation. In this work we analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems. We use a real-world low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a high-resource language pair (German-to-English) to test different scenarios with backtranslation and employ data selection to optimise the synthetic corpora. We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems. We further tune the data selection method by taking into account the quality of the MT systems used for backtranslation and lexical diversity of the resulting corpora. Our experiments show that incorporating backtranslated data from different sources can be beneficial, and that availing of data selection can yield improved performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题