论文标题
与NLP基准质量相关的参数的调查
A Survey of Parameters Associated with the Quality of Benchmarks in NLP
论文作者
论文摘要
已经建立了一些基准,以大量投资于资源,以跟踪我们在NLP中的进步。为了响应这些基准,成千上万的论文已经与顶级排行榜竞争,模型经常超过人类绩效。但是,最近的研究表明,模型仅通过过度拟合虚假偏见而胜过几个流行的基准测试,而无需真正学习所需的任务。尽管有这一发现,但基准测试在试图解决偏见的同时,仍然依靠解决方法,由于丢弃低质量数据并涵盖有限的偏见集,这并不能完全利用用于基准创建的资源。解决这些问题的潜在解决方案 - 量化质量 - 尚未得到充实的方式。受到电力,食物和水等多个领域的成功质量指数的启发,我们通过确定某些可以代表各种可能的相互作用的语言属性迈出了指标的第一步。我们寻找相关的参数,这些参数可能有助于铺平我们的途径。我们调查现有的作品并确定捕获偏见的各种属性的参数,其起源,类型以及对性能,概括和鲁棒性的影响。我们的分析跨越数据集以及从NLI到摘要等任务的层次结构,以确保我们的参数是通用的,并且不适合特定任务或数据集。我们还在此过程中开发了某些参数。
Several benchmarks have been built with heavy investment in resources to track our progress in NLP. Thousands of papers published in response to those benchmarks have competed to top leaderboards, with models often surpassing human performance. However, recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task. Despite this finding, benchmarking, while trying to tackle bias, still relies on workarounds, which do not fully utilize the resources invested in benchmark creation, due to the discarding of low quality data, and cover limited sets of bias. A potential solution to these issues -- a metric quantifying quality -- remains underexplored. Inspired by successful quality indices in several domains such as power, food, and water, we take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark. We look for bias related parameters which can potentially help pave our way towards the metric. We survey existing works and identify parameters capturing various properties of bias, their origins, types and impact on performance, generalization, and robustness. Our analysis spans over datasets and a hierarchy of tasks ranging from NLI to Summarization, ensuring that our parameters are generic and are not overfitted towards a specific task or dataset. We also develop certain parameters in this process.