完全合成数据可以通过知识蒸馏改善神经机器翻译

论文标题

完全合成数据可以通过知识蒸馏改善神经机器翻译

Fully Synthetic Data Improves Neural Machine Translation with Knowledge Distillation

论文作者

Aji, Alham Fikri, Heafield, Kenneth

论文摘要

本文探讨了在神经机器翻译中进行知识蒸馏的增强单语言数据。源语言单语文本可以作为正向翻译合并。有趣的是，我们找到合并目标语言单语言文本的最佳方法是将其转换为源语言，并将其转换回目标语言，从而产生完全合成的语料库。我们发现，将来自源和目标语言的单语言数据结合起来，其性能比仅在一种语言上大的语料库两倍。此外，实验表明改进取决于测试集的出处。如果测试集最初是用源语言（由翻译人员编写的目标侧），则向前翻译源单语言数据事项。如果测试集最初是在目标语言中（带有翻译人员编写的源），则将目标单语言数据合并。

This paper explores augmenting monolingual data for knowledge distillation in neural machine translation. Source language monolingual text can be incorporated as a forward translation. Interestingly, we find the best way to incorporate target language monolingual text is to translate it to the source language and round-trip translate it back to the target language, resulting in a fully synthetic corpus. We find that combining monolingual data from both source and target languages yields better performance than a corpus twice as large only in one language. Moreover, experiments reveal that the improvement depends upon the provenance of the test set. If the test set was originally in the source language (with the target side written by translators), then forward translating source monolingual data matters. If the test set was originally in the target language (with the source written by translators), then incorporating target monolingual data matters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题