映射基准创建和饱和的全球动态

论文标题

映射基准创建和饱和的全球动态

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

论文作者

Ott, Simon, Barbosa-Silva, Adriano, Blagec, Kathrin, Brauner, Jan, Samwald, Matthias

论文摘要

基准对于人工智能（AI）的衡量和转向进步至关重要。但是，最近的研究引起了人们对AI基准测试状态的关注，报告了基准过度拟合，基准饱和度以及基准数据集创建的集中化等问题。为了促进监测AI基准测试生态系统的健康状况，我们介绍了创建基准创建和饱和全球动力学的凝结图的方法。我们策划了涵盖计算机视觉和自然语言处理的整个领域的3765个基准测试的数据，并表明大量的基准迅速趋向于近乎饱和，许多基准测试未能找到广泛的利用，并且该基准为不同的AI任务的基准性能获得了艰巨的效果。我们分析与基准流行相关的属性，并得出结论，未来的基准应该强调多功能性，广度和现实世界实用程序。

Benchmarks are crucial to measuring and steering progress in artificial intelligence (AI). However, recent studies raised concerns over the state of AI benchmarking, reporting issues such as benchmark overfitting, benchmark saturation and increasing centralization of benchmark dataset creation. To facilitate monitoring of the health of the AI benchmarking ecosystem, we introduce methodologies for creating condensed maps of the global dynamics of benchmark creation and saturation. We curated data for 3765 benchmarks covering the entire domains of computer vision and natural language processing, and show that a large fraction of benchmarks quickly trended towards near-saturation, that many benchmarks fail to find widespread utilization, and that benchmark performance gains for different AI tasks were prone to unforeseen bursts. We analyze attributes associated with benchmark popularity, and conclude that future benchmarks should emphasize versatility, breadth and real-world utility.

下载PDF全文

下载文献需遵守相关版权规定

论文标题