论文标题
计算有效的稀疏集群
Computationally efficient sparse clustering
论文作者
论文摘要
当中心的均值稀疏并且其尺寸可能大于样本量时,我们研究聚类的统计和计算限制。我们的理论分析着重于$ x_i =z_iθ+ \ varepsilon_i,〜z_i \ in \ { - 1,1 \},〜\ \ varepsilon_i \ varrepsilon_i \ themsim \ Mathcal \ Mathcal {n}(n}(n}(0,i),它具有两种cluspers centers centers $ ch $θ$ $θ$。我们提供了基于稀疏PCA的新稀疏聚类算法的有限样本分析,并表明它在制度$ \ |θ\ |中实现了最小值最佳分类率。 \ rightarrow \ infty $。 我们的结果要求稀疏度比样本量的平方根慢。使用最新的计算下限框架(低度的可能性比率),我们提供了证据表明,对于任何多项式时间聚类算法都必须在BBP阈值以下取得成功。这是根据减少和统计查询下限的现有证据的补充。与这些现有结果相比,我们涵盖了一组更广泛的参数制度,并对所需的运行时以及可实现的错误分类错误提供了更精确的理解。我们的结果表明,基于低度多项式的大型测试甚至无法解决弱测试任务。
We study statistical and computational limits of clustering when the means of the centres are sparse and their dimension is possibly much larger than the sample size. Our theoretical analysis focuses on the model $X_i = z_i θ+ \varepsilon_i, ~z_i \in \{-1,1\}, ~\varepsilon_i \thicksim \mathcal{N}(0,I)$, which has two clusters with centres $θ$ and $-θ$. We provide a finite sample analysis of a new sparse clustering algorithm based on sparse PCA and show that it achieves the minimax optimal misclustering rate in the regime $\|θ\| \rightarrow \infty$. Our results require the sparsity to grow slower than the square root of the sample size. Using a recent framework for computational lower bounds -- the low-degree likelihood ratio -- we give evidence that this condition is necessary for any polynomial-time clustering algorithm to succeed below the BBP threshold. This complements existing evidence based on reductions and statistical query lower bounds. Compared to these existing results, we cover a wider set of parameter regimes and give a more precise understanding of the runtime required and the misclustering error achievable. Our results imply that a large class of tests based on low-degree polynomials fail to solve even the weak testing task.