跨语言实体标签映射的统计和神经方法

论文标题

跨语言实体标签映射的统计和神经方法

Statistical and Neural Methods for Cross-lingual Entity Label Mapping in Knowledge Graphs

论文作者

Amaral, Gabriel, Pinnis, Mārcis, Skadiņa, Inguna, Rodrigues, Odinaldo, Simperl, Elena

论文摘要

知识库（例如Wikidata Ampass Amass大量命名实体信息，例如多语言标签），这些信息对于各种多语言和跨语性应用程序非常有用。但是，从信息一致性的角度来看，不能保证这样的标签可以在语言上匹配，从而极大地损害了它们对机器翻译等字段的有用性。在这项工作中，我们研究了单词和句子对准技术的应用，再加上匹配算法，以用10种语言从Wikidata提取的跨语性实体标签。我们的结果表明，Wikidata的主标签之间的映射将通过任何使用的方法都大大提高（在F1分数中最高20美元）。我们展示了依赖句子嵌入的方法如何超过所有其他脚本，即使在不同的脚本上也是如此。我们认为，这种技术在测量标签对的相似性上的应用，再加上富含高质量实体标签的知识库，是机器翻译的绝佳资产。

Knowledge bases such as Wikidata amass vast amounts of named entity information, such as multilingual labels, which can be extremely useful for various multilingual and cross-lingual applications. However, such labels are not guaranteed to match across languages from an information consistency standpoint, greatly compromising their usefulness for fields such as machine translation. In this work, we investigate the application of word and sentence alignment techniques coupled with a matching algorithm to align cross-lingual entity labels extracted from Wikidata in 10 languages. Our results indicate that mapping between Wikidata's main labels stands to be considerably improved (up to $20$ points in F1-score) by any of the employed methods. We show how methods relying on sentence embeddings outperform all others, even across different scripts. We believe the application of such techniques to measure the similarity of label pairs, coupled with a knowledge base rich in high-quality entity labels, to be an excellent asset to machine translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题