论文标题

DQI:基准评估指南

DQI: A Guide to Benchmark Evaluation

论文作者

Mishra, Swaroop, Arunkumar, Anjana, Sachdeva, Bhavdeep, Bryan, Chris, Baral, Chitta

论文摘要

A的“最新状态” A模型A超过了基准B中的人类,但在类似的基准C,D和E上失败了。B的其他基准没有什么?最近的研究提供了答案:虚假偏见。但是,开发A来通过E解决基准B并不能保证它将解决未来的基准。为了朝着“真正学习”一项基本任务的模型迈进,我们需要量化连续的基准之间的差异,而不是现有的二进制和黑盒方法。我们提出了一种新的方法来解决这一通过限定数据质量指标(DQI)来量化基准质量的不脱颖而出的任务。

A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源