论文标题
可扩展的分层聚类与树木移植
Scalable Hierarchical Clustering with Tree Grafting
论文作者
论文摘要
我们介绍了Grinch,这是一种新算法,用于大规模,非怪兽分层聚类,并具有一般的链接函数,可在两个点集之间计算任意相似性。 Grinch的关键组成部分是其旋转和移植子例程,随着新点的到来,它们有效地重新配置了层次结构,从而支持发现具有复杂结构的簇。 Grinch的动机是由新的可分离性与链接函数进行聚类的概念:我们证明,当模型与地面群集一致时,Grinch可以保证产生包含地面真相的群集树,独立于数据到达顺序。我们在基准和作者Coreference数据集(具有标准和学习的链接函数)上的经验结果表明,格林奇比其他可扩展方法更准确,并且比层次结构聚集群集更快。
We introduce Grinch, a new algorithm for large-scale, non-greedy hierarchical clustering with general linkage functions that compute arbitrary similarity between two point sets. The key components of Grinch are its rotate and graft subroutines that efficiently reconfigure the hierarchy as new points arrive, supporting discovery of clusters with complex structure. Grinch is motivated by a new notion of separability for clustering with linkage functions: we prove that when the model is consistent with a ground-truth clustering, Grinch is guaranteed to produce a cluster tree containing the ground-truth, independent of data arrival order. Our empirical results on benchmark and author coreference datasets (with standard and learned linkage functions) show that Grinch is more accurate than other scalable methods, and orders of magnitude faster than hierarchical agglomerative clustering.