呼吁反思图像分类中故障检测的评估实践

论文标题

呼吁反思图像分类中故障检测的评估实践

A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification

论文作者

Jaeger, Paul F., Lüth, Carsten T., Klein, Lukas, Bungert, Till J.

论文摘要

基于机器学习的决策系统在野外的可靠应用是该领域目前研究的主要挑战之一。大部分既定方法旨在通过分配置信度得分来检测错误的预测。可以通过量化模型的预测不确定性，学习明确的评分功能或评估输入是否符合训练分布来获得这种置信度。奇怪的是，尽管这些方法都以实现现实生活应用程序来检测分类器失败的最终目标，但它们目前构成具有个人评估协议的很大程度上分离的研究领域，该协议不包括相关方法的大部分部分，或者忽略了相关故障源的大部分部分。在这项工作中，我们系统地揭示了由于这些不一致所引起的当前陷阱，并得出了对失败检测的整体和现实评估的要求。为了证明这一统一观点的相关性，我们首次提出了一项大规模的经验研究，以实现基准测试评分函数W.R.T W.R.T所有相关方法和失败来源。简单的软响应基线作为总体绩效方法的启示强调了当前评估的巨大缺点，这是关于置信度评分的大量宣传研究。代码和训练有素的模型在https://github.com/iml-dkfz/fd-hifts上。

Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-life application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring. Code and trained models are at https://github.com/IML-DKFZ/fd-shifts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题