论文标题

形态学在多个参数中重新构成:扩展注释模式和格鲁吉亚案例研究

Morphological Reinflection with Multiple Arguments: An Extended Annotation schema and a Georgian Case Study

论文作者

Guriel, David, Goldman, Omer, Tsarfaty, Reut

论文摘要

近年来,出现了一系列形态数据集,最著名的是Unimorph,这是一个多语言拐点的存储库。但是,当前的形态注释模式的平坦结构使某些语言的处理变得古怪,即使不是不可能的话,特别是在多个多种同意的情况下,动词与使用真实词缀的多个参数一致。在本文中,我们建议通过将Unimorph注释模式扩展到层次特征结构,以自然适应复杂的论证标记来解决这一现象。我们将这种扩展的模式应用于一种这样的语言,格鲁吉亚语,并为格鲁吉亚动词提供了人类验证,准确和平衡的形态数据集。与现有的UNIMORPH数据集相比,该数据集的表具有4倍,动词表格增加了6倍,涵盖了所有可能的参数标记变体,证明了我们提出的方案的充分性。具有标准重新触及模型的实验表明,当数据以形式级别拆分时,概括很容易,但是在沿引理线分裂时非常困难。预计将其他语言扩展到该模式将提高该基准的覆盖范围,一致性和解释性。

In recent years, a flurry of morphological datasets had emerged, most notably UniMorph, a multi-lingual repository of inflection tables. However, the flat structure of the current morphological annotation schema makes the treatment of some languages quirky, if not impossible, specifically in cases of polypersonal agreement, where verbs agree with multiple arguments using true affixes. In this paper, we propose to address this phenomenon by expanding the UniMorph annotation schema to a hierarchical feature structure that naturally accommodates complex argument marking. We apply this extended schema to one such language, Georgian, and provide a human-verified, accurate and balanced morphological dataset for Georgian verbs. The dataset has 4 times more tables and 6 times more verb forms compared to the existing UniMorph dataset, covering all possible variants of argument marking, demonstrating the adequacy of our proposed scheme. Experiments with a standard reinflection model show that generalization is easy when the data is split at the form level, but extremely hard when splitting along lemma lines. Expanding the other languages in UniMorph to this schema is expected to improve both the coverage, consistency and interpretability of this benchmark.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源