论文标题
模式计数数据集的标签
Patterns Count-Based Labels for Datasets
论文作者
论文摘要
属性值组合的计数对于数据集的分析至关重要,尤其是在确定使用和消除偏见和不公平性方面。虽然单个属性值的计数可能存储在某些数据集配置文件中,但属性的组合太多,无法为每种组合存储计数而实用。在本文中,我们开发了存储有限尺寸的“标签”的概念,该标签可用于获得这些计数的良好估计。在本文中,标签包含有关所选模式计数的信息 - 属于值组合 - 在数据中。我们定义一个估计函数,该函数使用此标签来估计每个模式的计数。我们提出了一个问题的问题,即在其尺寸上找到最佳标签,并提出一种用于生成最佳标签的启发式算法。我们通过实验表明从所得标签和算法的效率得出的计数估计值的准确性。
Counts of attribute-value combinations are central to the profiling of a dataset, particularly in determining fitness for use and in eliminating bias and unfairness. While counts of individual attribute values may be stored in some dataset profiles, there are too many combinations of attributes for it to be practical to store counts for each combination. In this paper, we develop the notion of storing a "label" of limited size that can be used to obtain good estimates for these counts. A label, in this paper, contains information regarding the count of selected patterns--attributes values combinations--in the data. We define an estimation function, that uses this label to estimate the count of every pattern. We present the problem of finding the optimal label given a bound on its size and propose a heuristic algorithm for generating optimal labels. We experimentally show the accuracy of count estimates derived from the resulting labels and the efficiency of our algorithm.