论文标题

决策树作为分区机器以表征其泛化属性

Decision trees as partitioning machines to characterize their generalization properties

论文作者

Leboeuf, Jean-Samuel, LeBlanc, Frédéric, Marchand, Mario

论文摘要

决策树是易于构建且易于解释的流行机器学习模型。即使要学习决策树的算法可以追溯到近50年,但影响其概括错误的关键属性仍然较弱。因此,我们从数据分区的角度重新审视了实现功能的二进制决策树。我们介绍了分区函数的概念,并将其与增长函数和VC维度联系起来。使用这个新概念,我们能够找到决策树桩的确切VC维度,由最大的整数$ d $给出,使得$ 2 \ ell \ ge \ binom {d} {\ left \ left \ lfloor \ frac {d} {d} {2} {2} {2} \ right \ rfloor}我们提供递归表达式以绑定分区功能,从而在任何决策树结构的生长函数上产生上限。这使我们可以证明具有$ n $内部节点的二进制树结构的VC维度为$ n \ log(n \ ell)$。最后,我们基于这些结果详细说明了一种修剪算法,该结果比许多数据集上的CART算法表现更好,而不需要交叉验证。

Decision trees are popular machine learning models that are simple to build and easy to interpret. Even though algorithms to learn decision trees date back to almost 50 years, key properties affecting their generalization error are still weakly bounded. Hence, we revisit binary decision trees on real-valued features from the perspective of partitions of the data. We introduce the notion of partitioning function, and we relate it to the growth function and to the VC dimension. Using this new concept, we are able to find the exact VC dimension of decision stumps, which is given by the largest integer $d$ such that $2\ell \ge \binom{d}{\left\lfloor\frac{d}{2}\right\rfloor}$, where $\ell$ is the number of real-valued features. We provide a recursive expression to bound the partitioning functions, resulting in a upper bound on the growth function of any decision tree structure. This allows us to show that the VC dimension of a binary tree structure with $N$ internal nodes is of order $N \log(N\ell)$. Finally, we elaborate a pruning algorithm based on these results that performs better than the CART algorithm on a number of datasets, with the advantage that no cross-validation is required.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源