论文标题
基于阈值的自动标记的承诺和陷阱
Promises and Pitfalls of Threshold-based Auto-labeling
论文作者
论文摘要
在监督机器学习工作流程中,创建大型高质量标签数据集是一个主要的瓶颈。基于阈值的自动标记(TBAL),其中从人类获得的验证数据用于找到一个置信阈值,该置信阈值在该置信阈值之上被机器标记,从而降低了对手动注释的依赖。在实践中,TBAL正在成为一种广泛使用的解决方案。考虑到所得数据集的长期保质期和多样化的使用情况,了解这种自动标记系统获得的数据何时可以依靠。这是第一项分析TBAL系统并根据保证机器标记数据质量所需的人类标记验证数据的数量来分析样品复杂性界限的工作。我们的结果提供了两个关键见解。首先,合理的未标记数据可以自动,准确地被看似糟糕的模型标记。其次,TBAL系统的隐藏缺点是潜在的验证数据使用情况。这些见解共同描述了使用此类系统的希望和陷阱。我们通过对合成和真实数据集进行广泛的实验来验证我们的理论保证。
Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Threshold-based auto-labeling (TBAL), where validation data obtained from humans is used to find a confidence threshold above which the data is machine-labeled, reduces reliance on manual annotation. TBAL is emerging as a widely-used solution in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. This is the first work to analyze TBAL systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two crucial insights. First, reasonable chunks of unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of TBAL systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.