论文标题
大规模集群的可扩展初始化方法
Scalable Initialization Methods for Large-Scale Clustering
论文作者
论文摘要
在这项工作中,提出了两种针对K-均值聚类的新初始化方法。这两个建议均基于对K均值应用划分和争议的方法||初始化策略的类型。第二个建议还利用了由随机投影方法产生的多个初始化产生的低维子空间。提出的方法是可扩展的,可以并行运行,这使其适合初始化大规模问题。在实验中,提出的方法与K-均值++和k均值的比较||使用广泛的参考和合成大规模数据集进行方法。关于后者,给出了一种新型的高维数据生成算法。实验表明,所提出的方法与最先进的方法相比。我们还观察到,在非常高维的情况下,当前最流行的K-均值++初始化的行为就像随机的。
In this work, two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means|| type of an initialization strategy. The second proposal also utilizes multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means|| methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases.