论文标题
El Dectizonmo de Nosotros:机器翻译的Corpora如何影响MRC任务中的语言模型
El Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks
论文作者
论文摘要
培训预训练大规模语言模型(LMS)需要大量文本语料库。英语的LMS享受不断发展的多种语言资源的语料库。但是,资源较低的语言及其单语和多语言LMS通常很难获得更大的数据集。在这种情况下,一种典型的方法意味着将英语语料库的机器翻译为目标语言。在这项工作中,我们研究了将直接翻译的Corpora应用于下游自然语言处理任务进行微调LMS的警告,并证明了仔细的策划以及后处理能够提高性能和整体LMS的鲁棒性。在经验评估中,我们对用户和系统级别上的西班牙小队数据集进行了直接翻译的比较。关于Xquad和MLQA转移学习评估问题答案任务的进一步实验结果表明,根据确切的匹配分数,大概是多语言LMS对机器翻译工件具有更大的弹性。
Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced languages and their mono- and multilingual LMs often struggle to obtain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA transfer-learning evaluation question answering tasks show that presumably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.