DQI：基准评估指南

论文标题

DQI：基准评估指南

DQI: A Guide to Benchmark Evaluation

论文作者

Mishra, Swaroop, Arunkumar, Anjana, Sachdeva, Bhavdeep, Bryan, Chris, Baral, Chitta

论文摘要

A的“最新状态” A模型A超过了基准B中的人类，但在类似的基准C，D和E上失败了。B的其他基准没有什么？最近的研究提供了答案：虚假偏见。但是，开发A来通过E解决基准B并不能保证它将解决未来的基准。为了朝着“真正学习”一项基本任务的模型迈进，我们需要量化连续的基准之间的差异，而不是现有的二进制和黑盒方法。我们提出了一种新的方法来解决这一通过限定数据质量指标（DQI）来量化基准质量的不脱颖而出的任务。

A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.

下载PDF全文

下载文献需遵守相关版权规定

论文标题