论文标题

一种用于基于置信的必须链接的聚类算法,无法链接约束

An algorithm for clustering with confidence-based must-link and cannot-link constraints

论文作者

Baumann, Philipp, Hochbaum, Dorit S.

论文摘要

我们在这里研究半监督的$ k $ coltustering问题,其中可以提供有关对象对在相同还是在不同簇中的信息。此信息可以确定性地或有限的信心提供。我们介绍了PCCC(成对信心构成群集)算法,该算法将对象分配给群集,同时考虑对象成对上提供的信息。我们的算法使用整数编程来分配对象,这些对象允许将关系作为坚硬的约束,这些约束保证可以得到满足或可以违反受惩罚的软约束。这种灵活性将我们的算法与最新算法区分开,其中所有成对约束都被认为是硬的,或者全部被认为是软的。我们为整数程序开发了增强的多开始方法和一种模型大小的减少技术,该技术有助于算法的有效性和效率。与现有算法不同,我们的算法量表与大规模实例有多达60,000个对象,100个簇和数百万无法链接约束(这是合并最具挑战性的约束)。我们将PCCC算法与一项广泛的计算研究中的最新方法进行了比较。即使PCCC算法比其适用性的最先进方法更笼统,但它在运行时和解决方案质量的各种指标方面都超过了所有硬或所有软约束的实例的最先进方法。 PCCC算法的代码在GitHub上公开可用。

We study here the semi-supervised $k$-clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC (Pairwise-Confidence-Constraints-Clustering) algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm uses integer programming for the assignment of objects which allows to include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. We developed an enhanced multi-start approach and a model-size reduction technique for the integer program that contributes to the effectiveness and the efficiency of the algorithm. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard or all soft constraints both in terms of runtime and various metrics of solution quality. The code of the PCCC algorithm is publicly available on GitHub.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源