论文标题

无限溪流上SVD的稀疏核心

Sparse Coresets for SVD on Infinite Streams

论文作者

Braverman, Vladimir, Feldman, Dan, Lang, Harry, Rus, Daniela, Statman, Adiel

论文摘要

在流式值分解(SVD)中,可能无限矩阵的$ D $维行依次以$ \ Mathbb {r}^d $为点。 $ε$ -CORESET是一个(较小)的矩阵,其行的平方距离与任何超平面近似于原始矩阵的平面距离近似于$ 1 \ pmε$ factor。我们的主要结果是,我们可以维护$ε$ -CORESET,而仅存储$ O(d \ log^2 d /ε^2)$行。已知的$ω(D /ε^2)$行的已知下限表明这几乎是最佳的。此外,我们核心的每一行都是输入行的加权子集。这是非常可取的,因为它是:(1)保留稀疏性; (2)很容易解释; (3)避免精确错误; (4)适用于在输入上限制的问题。 SVD的先前流媒体结果,该结果返回所需的输入子集$ω(d \ log^3 n /ε^2)$行,其中$ n $是到目前为止看到的行数。我们的算法独立于$ n $,是在无限流上使用有限内存的第一个结果。我们通过针对最先进的算法的Wikipedia数据集基准测试的Wikipedia数据集实验来支持我们的发现。

In streaming Singular Value Decomposition (SVD), $d$-dimensional rows of a possibly infinite matrix arrive sequentially as points in $\mathbb{R}^d$. An $ε$-coreset is a (much smaller) matrix whose sum of square distances of the rows to any hyperplane approximates that of the original matrix to a $1 \pm ε$ factor. Our main result is that we can maintain a $ε$-coreset while storing only $O(d \log^2 d / ε^2)$ rows. Known lower bounds of $Ω(d / ε^2)$ rows show that this is nearly optimal. Moreover, each row of our coreset is a weighted subset of the input rows. This is highly desirable since it: (1) preserves sparsity; (2) is easily interpretable; (3) avoids precision errors; (4) applies to problems with constraints on the input. Previous streaming results for SVD that return a subset of the input required storing $Ω(d \log^3 n / ε^2)$ rows where $n$ is the number of rows seen so far. Our algorithm, with storage independent of $n$, is the first result that uses finite memory on infinite streams. We support our findings with experiments on the Wikipedia dataset benchmarked against state-of-the-art algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源