多合成形态学分析的引导技术

论文标题

多合成形态学分析的引导技术

Bootstrapping Techniques for Polysynthetic Morphological Analysis

论文作者

Lane, William, Bird, Steven

论文摘要

多合成语言具有非常大且稀疏的词汇，这要归功于单词中的词素插槽和组合的数量。这种复杂性以及书面数据普遍稀缺，对自然语言技术的发展构成了挑战。为了应对这一挑战，我们为引导神经形态分析仪提供了语言信息的方法，并证明了其对多合成澳大利亚语言Kunwinjku的应用。我们从有限状态传感器生成数据来训练编码器模型。我们通过将缺失的语言结构“幻觉”为训练数据来改善模型，并通过从ZIPF分布重新采样以模拟词素的更自然分布。最好的模型解释了在测试集中重复重复的所有实例，并且总体上的准确性为94.7％，比FST基线的10个百分点提高了。该过程表明了从最低资源中引导神经变形分析仪的可行性。

Polysynthetic languages have exceptionally large and sparse vocabularies, thanks to the number of morpheme slots and combinations in a word. This complexity, together with a general scarcity of written data, poses a challenge to the development of natural language technologies. To address this challenge, we offer linguistically-informed approaches for bootstrapping a neural morphological analyzer, and demonstrate its application to Kunwinjku, a polysynthetic Australian language. We generate data from a finite state transducer to train an encoder-decoder model. We improve the model by "hallucinating" missing linguistic structure into the training data, and by resampling from a Zipf distribution to simulate a more natural distribution of morphemes. The best model accounts for all instances of reduplication in the test set and achieves an accuracy of 94.7% overall, a 10 percentage point improvement over the FST baseline. This process demonstrates the feasibility of bootstrapping a neural morph analyzer from minimal resources.

下载PDF全文

下载文献需遵守相关版权规定

论文标题