论文标题
通过异构本地专家的全球多类分类和数据集构建
Global Multiclass Classification and Dataset Construction via Heterogeneous Local Experts
论文作者
论文摘要
在数据集构建和众包的范围内,一个值得注意的挑战是汇总了一组异质标签的标签,每个标签都可能是某些任务子集(在其他任务中较不可靠)的专家。为了降低雇用人类标签或培训自动标签系统的成本,有趣的是,在确保所得数据集的可靠性的同时,最大程度地减少了标签数量。我们将其模拟为使用较小分类器的预测执行$ K $类别的问题,每个分类器都以$ [k] $的子集进行了培训,并根据对对抗和随机假设的准确推断未标记的样本所需的分类数量来得出界限。通过利用与经典套装问题的连接,我们生成了一个近乎最佳的方案,用于设计分类器的此类配置,以恢复众所周知的一vs。一种分类方法作为一种特殊情况。使用MNIST和CIFAR-10数据集进行的实验证明了我们的聚合方案的有利准确性(与集中式分类器相比)应用于对数据子集进行培训的分类器。这些结果提出了一种新的方法,可以自动标记数据或将现有的一组本地分类器适应大型多类问题。
In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of the resulting dataset. We model this as the problem of performing $K$-class classification using the predictions of smaller classifiers, each trained on a subset of $[K]$, and derive bounds on the number of classifiers needed to accurately infer the true class of an unlabeled sample under both adversarial and stochastic assumptions. By exploiting a connection to the classical set cover problem, we produce a near-optimal scheme for designing such configurations of classifiers which recovers the well known one-vs.-one classification approach as a special case. Experiments with the MNIST and CIFAR-10 datasets demonstrate the favorable accuracy (compared to a centralized classifier) of our aggregation scheme applied to classifiers trained on subsets of the data. These results suggest a new way to automatically label data or adapt an existing set of local classifiers to larger-scale multiclass problems.