使用MDL发现数字目标的出色亚组列表

论文标题

使用MDL发现数字目标的出色亚组列表

Discovering outstanding subgroup lists for numeric targets using MDL

论文作者

Proença, Hugo M., Grünwald, Peter, Bäck, Thomas, van Leeuwen, Matthijs

论文摘要

亚组发现（SD）的任务是找到有关目标属性脱颖而出的数据集子集的可解释描述。为了解决开采大量冗余子组的问题，已经提出了亚组集合（SSD）。不过，最先进的SSD方法具有其局限性，因为它们通常严重依赖启发式方法和/或用户选择的超参数。我们为子组集发现的分散感知问题提出了基于最小描述长度（MDL）原理和子组列表的问题。我们认为，最好的亚组列表是最能汇总给定目标整体分布的数据的列表。我们将重点限制在单个数字目标变量上，并表明我们的形式化与找到单个子组时的现有质量度量相吻合，但是该加法功能允许以子组的复杂性来交易亚组质量。接下来，我们提出了SSD ++，这是一种启发式算法，我们从经验上证明它返回出色的亚组列表：非冗余的紧凑型亚组集，它们通过具有强烈偏差的手段和较小的差异而脱颖而出。

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters. We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that-in addition-it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

下载PDF全文

下载文献需遵守相关版权规定

论文标题