自动校正句法依赖性注释差异

论文标题

自动校正句法依赖性注释差异

Automatic Correction of Syntactic Dependency Annotation Differences

论文作者

Zupon, Andrew, Carnie, Andrew, Hammond, Michael, Surdeanu, Mihai

论文摘要

数据集之间的注释不一致可能会导致低资源NLP的问题，在这种情况下，与资源丰富的语言相比，嘈杂或不一致的数据不能轻易替换。在本文中，我们提出了一种自动检测依赖性解析语料库之间的注释不匹配的方法，以及自动转换不匹配的三种相关方法。这三种方法均均均均均在新的语料库中比较一个看不见的示例，并在现有语料库中与类似的示例进行比较。这三种方法包括使用现有语料库中示例最常见的标签的简单词汇替换，基于手套的基于嵌入的替代品，它考虑了更宽的示例池，以及基于BERT嵌入的替代品，该替代品使用上下文化的嵌入式嵌入方式，以对我们的特定数据进行微调。然后，我们通过将两个依赖性解析器（Qi etal。2020）和解析为标记（PAT）（vacareanu等，2020）评估这些转换 - 在转换和未转化的数据上评估这些转换。我们发现，在许多情况下，应用转换会产生更好的性能。观察到两个解析器之间观察到的一些差异。 STANZA具有具有二次算法的更复杂的体系结构，因此训练需要更长的时间，但是可以通过更少的数据来更好地概括。 PAT解析器具有更简单的架构，具有线性算法，加快了训练时间，但需要更多的培训数据才能达到可比或更好的性能。

Annotation inconsistencies between data sets can cause problems for low-resource NLP, where noisy or inconsistent data cannot be as easily replaced compared with resource-rich languages. In this paper, we propose a method for automatically detecting annotation mismatches between dependency parsing corpora, as well as three related methods for automatically converting the mismatches. All three methods rely on comparing an unseen example in a new corpus with similar examples in an existing corpus. These three methods include a simple lexical replacement using the most frequent tag of the example in the existing corpus, a GloVe embedding-based replacement that considers a wider pool of examples, and a BERT embedding-based replacement that uses contextualized embeddings to provide examples fine-tuned to our specific data. We then evaluate these conversions by retraining two dependency parsers -- Stanza (Qi et al. 2020) and Parsing as Tagging (PaT) (Vacareanu et al. 2020) -- on the converted and unconverted data. We find that applying our conversions yields significantly better performance in many cases. Some differences observed between the two parsers are observed. Stanza has a more complex architecture with a quadratic algorithm, so it takes longer to train, but it can generalize better with less data. The PaT parser has a simpler architecture with a linear algorithm, speeding up training time but requiring more training data to reach comparable or better performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题