论文标题

基于统一的跨数据标准的自然语言推断的可靠评估

Reliable Evaluations for Natural Language Inference based on a Unified Cross-dataset Benchmark

论文作者

Zhang, Guanhua, Bai, Bing, Liang, Jian, Bai, Kun, Zhu, Conghui, Zhao, Tiejun

论文摘要

最近的研究表明,众源的自然语言推断(NLI)数据集可能会遭受诸如注释工件之类的巨大偏见。利用这些表面线索的模型在内域测试集上获得了海市rage优势,这使得评估结果过高估计。缺乏值得信赖的评估环境和基准标准阻碍了NLI研究的进步。在本文中,我们建议通过评估模型值得信赖的概括性能。我们提供了一个新的统一的跨数据集基准,该基准具有14个NLI数据集,并重新评估了9种广泛使用的基于神经网络的NLI模型,以及5种针对注释伪像的呈现方法。我们提出的评估计划和实验基线可以为激发未来可靠的NLI研究提供基础。

Recent studies show that crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts. Models utilizing these superficial clues gain mirage advantages on the in-domain testing set, which makes the evaluation results over-estimated. The lack of trustworthy evaluation settings and benchmarks stalls the progress of NLI research. In this paper, we propose to assess a model's trustworthy generalization performance with cross-datasets evaluation. We present a new unified cross-datasets benchmark with 14 NLI datasets, and re-evaluate 9 widely-used neural network-based NLI models as well as 5 recently proposed debiasing methods for annotation artifacts. Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源