学习映射$ \ mathbf {x} \ mapsto \ sum_ {i = 1}^d x_i^2 $：在haystack中找到针的成本

论文标题

学习映射$ \ mathbf {x} \ mapsto \ sum_ {i = 1}^d x_i^2 $：在haystack中找到针的成本

Learning the mapping $\mathbf{x}\mapsto \sum_{i=1}^d x_i^2$: the cost of finding the needle in a haystack

论文作者

Zhang, Jiefu, Zepeda-Núñez, Leonardo, Yao, Yuan, Lin, Lin

论文摘要

使用机器学习来近似映射$ \ mathbf {x} \ mapsto \ sum_ {i = 1}^d x_i^2 $的任务似乎是琐碎的。考虑到函数可分离结构的知识，可以设计一个稀疏网络以非常准确地表示函数，甚至准确地表示功能。当没有可用的结构信息，并且我们只能使用密集的神经网络时，优化过程以找到嵌入在密集网络中的稀疏网络类似于使用给定数量的功能示例在干草堆中找到针头。我们证明，查找针的成本（通过样品复杂性）与该功能的键范围直接相关。虽然只需要少量样品来训练一个稀疏的网络，但使用相同数量的样品训练的密集网络表现出较大的测试损失和较大的概括差距。为了控制概括差距的大小，我们发现随着$ d $的增加，显式正则化的使用变得越来越重要。具有显式正则化量表的数值观察到的样品复杂性为$ \ Mathcal {o}（d^{2.5}）$，实际上比理论上预测的样品复杂性更好，该样品的复杂性为$ \ Mathcal {o}（o}（o}（d^{4}））$。如果没有明确的正则化（也称为隐式正则化），数值观察到的样本复杂性明显更高，接近$ \ MATHCAL {O}（d^{4.5}）$。

The task of using machine learning to approximate the mapping $\mathbf{x}\mapsto\sum_{i=1}^d x_i^2$ with $x_i\in[-1,1]$ seems to be a trivial one. Given the knowledge of the separable structure of the function, one can design a sparse network to represent the function very accurately, or even exactly. When such structural information is not available, and we may only use a dense neural network, the optimization procedure to find the sparse network embedded in the dense network is similar to finding the needle in a haystack, using a given number of samples of the function. We demonstrate that the cost (measured by sample complexity) of finding the needle is directly related to the Barron norm of the function. While only a small number of samples is needed to train a sparse network, the dense network trained with the same number of samples exhibits large test loss and a large generalization gap. In order to control the size of the generalization gap, we find that the use of explicit regularization becomes increasingly more important as $d$ increases. The numerically observed sample complexity with explicit regularization scales as $\mathcal{O}(d^{2.5})$, which is in fact better than the theoretically predicted sample complexity that scales as $\mathcal{O}(d^{4})$. Without explicit regularization (also called implicit regularization), the numerically observed sample complexity is significantly higher and is close to $\mathcal{O}(d^{4.5})$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题