基于统一的跨数据标准的自然语言推断的可靠评估

论文标题

基于统一的跨数据标准的自然语言推断的可靠评估

Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark

论文作者

Zhang, Guanhua, Bai, Bing, Liang, Jian, Bai, Kun, Zhu, Conghui, Zhao, Tiejun

论文摘要

最近的研究表明，众源的自然语言推断（NLI）数据集可能会遭受诸如注释工件之类的巨大偏见。利用这些表面线索的模型在内域测试集上获得了海市rage优势，这使得评估结果过高估计。缺乏值得信赖的评估环境和基准标准阻碍了NLI研究的进步。在本文中，我们建议通过评估模型值得信赖的概括性能。我们提供了一个新的统一的跨数据集基准，该基准具有14个NLI数据集，并重新评估了9种广泛使用的基于神经网络的NLI模型，以及5种针对注释伪像的呈现方法。我们提出的评估计划和实验基线可以为激发未来可靠的NLI研究提供基础。

Recent studies show that crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts. Models utilizing these superficial clues gain mirage advantages on the in-domain testing set, which makes the evaluation results over-estimated. The lack of trustworthy evaluation settings and benchmarks stalls the progress of NLI research. In this paper, we propose to assess a model's trustworthy generalization performance with cross-datasets evaluation. We present a new unified cross-datasets benchmark with 14 NLI datasets, and re-evaluate 9 widely-used neural network-based NLI models as well as 5 recently proposed debiasing methods for annotation artifacts. Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题