论文标题
域不匹配并不总是阻止跨语性转移学习
Domain Mismatch Doesn't Always Prevent Cross-Lingual Transfer Learning
论文作者
论文摘要
Cross-lingual transfer learning without labeled target language data or parallel text has been surprisingly effective in zero-shot cross-lingual classification, question answering, unsupervised machine translation, etc. However, some recent publications have claimed that domain mismatch prevents cross-lingual transfer, and their results show that unsupervised bilingual lexicon induction (UBLI) and unsupervised neural machine translation (UNMT) do not work well when the潜在的单语中心来自不同的领域(例如,Wikipedia的法语文本,但来自联合国诉讼中的英文文字)。在这项工作中,我们表明一种简单的初始化方案可以克服域不匹配在跨语性转移中的许多影响。我们在串联的域与匹配的语料库上预先训练单词和上下文嵌入,并将其用作三个任务的初始化:Muse Ubli,Un Parallel Unmt和Semeval 2017跨语性单词相似性任务。在所有情况下,我们的结果通过表明适当的初始化可以恢复域不匹配所产生的损失的很大一部分来挑战先前工作的结论。
Cross-lingual transfer learning without labeled target language data or parallel text has been surprisingly effective in zero-shot cross-lingual classification, question answering, unsupervised machine translation, etc. However, some recent publications have claimed that domain mismatch prevents cross-lingual transfer, and their results show that unsupervised bilingual lexicon induction (UBLI) and unsupervised neural machine translation (UNMT) do not work well when the underlying monolingual corpora come from different domains (e.g., French text from Wikipedia but English text from UN proceedings). In this work, we show that a simple initialization regimen can overcome much of the effect of domain mismatch in cross-lingual transfer. We pre-train word and contextual embeddings on the concatenated domain-mismatched corpora, and use these as initializations for three tasks: MUSE UBLI, UN Parallel UNMT, and the SemEval 2017 cross-lingual word similarity task. In all cases, our results challenge the conclusions of prior work by showing that proper initialization can recover a large portion of the losses incurred by domain mismatch.