跨流派的合奏方法，用于稳健的reddit一部分语音标签的一部分

论文标题

跨流派的合奏方法，用于稳健的reddit一部分语音标签的一部分

A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging

论文作者

Behzad, Shabnam, Zeldes, Amir

论文摘要

语音标签的一部分是一项基本的NLP任务，通常被认为是针对英语等高资源语言解决的。当前的最新模型已经达到了很高的准确性，尤其是在新闻领域。但是，当这些模型被应用于具有不同类型的其他类型，尤其是从Web的用户生成的数据时，我们会看到性能大幅下降。在这项工作中，我们研究了未经过滤的Reddit论坛讨论对不同类型的最先进的标记模型在Web内容上执行的方式。更具体地说，我们使用来自多个来源的数据：Ontonotes，一个带有“良好编辑”文本的大型基准语料库，具有5种Web类型的英语Web Treebank和Gum，除Reddit以外还有7种流派。我们报告在对Reddit进行测试的数据分割的培训时的结果。我们的结果表明，即使是少量的内域数据也可以胜过数据的贡献，而来自其他Web域则大的数量级。为了在室外标记上取得进展，我们还使用多个单一生气标记器作为元分类器的输入功能评估了合奏方法。我们介绍了标记Reddit数据的最新性能，以及对这些模型结果的错误分析，并通过培训语料库分解了它们中最常见的错误类型的类型。

Part of speech tagging is a fundamental NLP task often regarded as solved for high-resource languages such as English. Current state-of-the-art models have achieved high accuracy, especially on the news domain. However, when these models are applied to other corpora with different genres, and especially user-generated data from the Web, we see substantial drops in performance. In this work, we study how a state-of-the-art tagging model trained on different genres performs on Web content from unfiltered Reddit forum discussions. More specifically, we use data from multiple sources: OntoNotes, a large benchmark corpus with 'well-edited' text, the English Web Treebank with 5 Web genres, and GUM, with 7 further genres other than Reddit. We report the results when training on different splits of the data, tested on Reddit. Our results show that even small amounts of in-domain data can outperform the contribution of data an order of magnitude larger coming from other Web domains. To make progress on out-of-domain tagging, we also evaluate an ensemble approach using multiple single-genre taggers as input features to a meta-classifier. We present state of the art performance on tagging Reddit data, as well as error analysis of the results of these models, and offer a typology of the most common error types among them, broken down by training corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题