论文标题
素描和升级:可扩展的子采样半决赛程序
Sketch-and-Lift: Scalable Subsampled Semidefinite Program for $K$-means Clustering
论文作者
论文摘要
半决赛编程(SDP)是解决广泛的计算困难问题(例如聚类)的强大工具。尽管精确度很高,但在实践中,半决赛程序通常太慢了,大型(甚至中等)数据集的可扩展性差。在本文中,我们引入了一种线性时间复杂性算法,用于近似SDP放松的$ k $ -MEANS聚类。所提出的草图和升级方法(SL)方法求解了亚次采样数据集上的SDP,然后通过最近的中央式圆形程序将解决方案传播到所有数据点。结果表明,SL方法具有与完整数据集中的$ K $ -Means SDP相似的精确恢复阈值,该数据集中已知在高斯混合模型下,在理论上是信息紧密的。当群集大小不平衡时,可以通过增强的理论特性使SL方法具有自适应。我们的仿真实验表明,所提出的方法的统计准确性优于最先进的快速聚类算法,而不会牺牲过多的计算效率,并且与原始的$ k $ -MEANS SDP相当,其运行时大大降低。
Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed $K$-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the $K$-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original $K$-means SDP with substantially reduced runtime.