论文标题

素描转换的矩阵,并应用于自然语言处理

Sketching Transformed Matrices with Applications to Natural Language Processing

论文作者

Liang, Yingyu, Song, Zhao, Wang, Mengdi, Yang, Lin F., Yang, Xin

论文摘要

假设我们给出了一个大矩阵$ a =(a_ {i,j})$,该$无法存储在内存中,而是在磁盘中或显示在数据流中。但是,我们需要计算某些功能$ f $的矩阵转换矩阵的矩阵分解,$ f(a):=(f(a_ {i,j}))$。是否可以以一种有效的方式进行操作?确实,许多机器学习应用程序确实需要处理如此大的转换矩阵,例如,NLP中的单词嵌入方法需要与PointSisce相互信息(PMI)矩阵配合使用,而入口转换使得很难应用已知的线性代数工具。此问题的现有方法要么需要存储整个矩阵并之后执行入门转换,这是空间消耗或不可行的,要么需要重新设计学习方法,该方法是针对应用程序的,需要进行大量的重塑。 在本文中,我们首先提出了一种用于计算给定小矩阵使用转换矩阵的乘积的素描算法。它适用于具有可证明的小误差范围的普通转型家族,因此可以用作下游学习任务的原始性。然后,我们将此原始应用应用于混凝土应用:低级别近似值。我们表明,我们的方法获得了微小的错误,并且在空间和时间上都是有效的。我们通过有关合成和真实数据的实验来补充理论结果。

Suppose we are given a large matrix $A=(a_{i,j})$ that cannot be stored in memory but is in a disk or is presented in a data stream. However, we need to compute a matrix decomposition of the entry-wisely transformed matrix, $f(A):=(f(a_{i,j}))$ for some function $f$. Is it possible to do it in a space efficient way? Many machine learning applications indeed need to deal with such large transformed matrices, for example word embedding method in NLP needs to work with the pointwise mutual information (PMI) matrix, while the entrywise transformation makes it difficult to apply known linear algebraic tools. Existing approaches for this problem either need to store the whole matrix and perform the entry-wise transformation afterwards, which is space consuming or infeasible, or need to redesign the learning method, which is application specific and requires substantial remodeling. In this paper, we first propose a space-efficient sketching algorithm for computing the product of a given small matrix with the transformed matrix. It works for a general family of transformations with provable small error bounds and thus can be used as a primitive in downstream learning tasks. We then apply this primitive to a concrete application: low-rank approximation. We show that our approach obtains small error and is efficient in both space and time. We complement our theoretical results with experiments on synthetic and real data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源