论文标题
大量多语言文档对齐与跨语性句子距离的距离
Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover's Distance
论文作者
论文摘要
文档对齐旨在识别两种不同语言的文档对,这些文档彼此相当或翻译。这种对齐数据可用于各种NLP任务,从训练跨语言表示到采矿并行数据以进行机器翻译。在本文中,我们开发了一个无监督的评分函数,该功能利用跨语性句子嵌入来计算不同语言中文档之间的语义距离。然后,这些语义距离被用来指导文档对齐算法,以正确地将跨语言网络文档与各种低音,中和高资源语言对配对。认识到我们所提出的评分函数和其他最先进的方法在计算上对于长网络文档来说是棘手的,因此我们利用了一种更可行的贪婪算法,该算法的性能可相当。我们通过实验表明,与当前基线相比,我们的距离指标在高资源语言对上的比对更好,在中资源对的15%,而低资源语言对的22%。
Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel data for machine translation. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages. These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs. Recognizing that our proposed scoring function and other state of the art methods are computationally intractable for long web documents, we utilize a more tractable greedy algorithm that performs comparably. We experimentally demonstrate that our distance metric performs better alignment than current baselines outperforming them by 7% on high-resource language pairs, 15% on mid-resource language pairs, and 22% on low-resource language pairs.