论文标题
可扩展的私人聚类通过层次分离的树
Scalable Differentially Private Clustering via Hierarchically Separated Trees
论文作者
论文摘要
我们在$ d $ dimensional Euclidean Space中研究私人$ k $ -Median和$ k $ -Means聚类问题。通过利用树的嵌入,我们提供了一种有效且易于实现的算法,该算法具有与最先进的非私人方法的经验竞争。我们证明我们的方法计算一个最多$ o(d^{3/2} \ log n)\ cdot opt + o(k d^2 \ log^2 n /ε^2)$的解决方案,其中$ε$是隐私保证。 (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points.我们还表明,在大规模分布式计算环境中,我们的方法与并行化。特别是我们表明,我们的私人算法可以在sublinear内存制度中的对数MPC回合中实现。最后,我们通过经验评估来补充理论分析,证明了该算法与其他隐私聚类基线相比的效率和准确性。
We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / ε^2)$, where $ε$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.