论文标题

基于图的多语言标签传播,用于低资源的语音部分标记

Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging

论文作者

Imani, Ayyoob, Severini, Silvia, Sabet, Masoud Jalili, Yvon, François, Schütze, Hinrich

论文摘要

语音部分(POS)标记是NLP管道的重要组成部分,但是许多低资源语言缺乏标记的培训数据。在这种情况下,一种培训POS标记器的既定方法是通过从高资源语言转移来创建标记的培训。在本文中,我们提出了一种新颖的方法,将标签从多个高资源来源传输到低资源目标语言。我们将POS标签投影正式化为基于图的标签传播。给定多种语言的句子的翻译,我们通过对所有语言对的单词对齐单词来创建一个单词作为节点和对齐链接的图形。然后,我们使用带有变压器层增强的图神经网络从源到目标传播节点标签。我们表明,我们的传播创建了训练集,使我们能够培训POS标签者的多种语言。当与增强的上下文化嵌入结合使用时,我们的方法为低资源语言的无监督POS标记实现了新的最先进。

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源