论文标题

神经网络的矫正视图:表示,记忆和学习

A Corrective View of Neural Networks: Representation, Memorization and Learning

论文作者

Bresler, Guy, Nagaraj, Dheeraj

论文摘要

我们开发了一种神经网络近似的纠正机制:将总可用的非线性单元分为多组,第一组近似于所考虑的函数,第二组近似于第一组产生的近似值中的误差并纠正它,第三组近似于第一组和第二组产生的误差,等等。该技术为神经网络提供了几种新的表示和学习结果。首先,我们表明,使用$ \ tilde {o}(o}(n)$ relus在$ n $中,$ n $最佳的因素,在欧几里得距离分离条件下,在欧几里得距离分离条件下的任意点的任意标签可以记住任意点的任意标签。接下来,我们为两层神经网络提供了有力的表示结果,并具有缓解和平滑的余量,可以在$ o(c(a,d)ε^{ - 1/(a+1)} $的情况下达到平方误差,对于$ a \ a \ in \ mathbb {n} n} \ cup \ cup \ cup {0 pmplook whs y hes formame(unding)衍生物)。在某些情况下,$ d $可以用有效的尺寸$ q \ ll d $替换。这种类型的先前结果使用深度体系结构实现了泰勒串联近似。我们还考虑了三层神经网络,并表明纠正机制可为光滑的径向功能产生更快的表示速率。最后,我们获得了第一个$ o(\ mathrm {subpoly}(1/ε))$上限,这是一个通过梯度下降到平方错误$ε$的两个层网络所需的神经元数量上的神经元的数量。即使深层网络可以用$ o(\ mathrm {polylog}(1/ε))$神经元表达这些多项式,但此问题的最佳学习界限需要$ \ mathrm {poly}(1/ε)$ neurons。

We develop a corrective mechanism for neural network approximation: the total available non-linear units are divided into multiple groups and the first group approximates the function under consideration, the second group approximates the error in approximation produced by the first group and corrects it, the third group approximates the error produced by the first and second groups together and so on. This technique yields several new representation and learning results for neural networks. First, we show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels for arbitrary points under under Euclidean distance separation condition using $\tilde{O}(n)$ ReLUs which is optimal in $n$ up to logarithmic factors. Next, we give a powerful representation result for two-layer neural networks with ReLUs and smoothed ReLUs which can achieve a squared error of at most $ε$ with $O(C(a,d)ε^{-1/(a+1)})$ for $a \in \mathbb{N}\cup\{0\}$ when the function is smooth enough (roughly when it has $Θ(ad)$ bounded derivatives). In certain cases $d$ can be replaced with effective dimension $q \ll d$. Previous results of this type implement Taylor series approximation using deep architectures. We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions. Lastly, we obtain the first $O(\mathrm{subpoly}(1/ε))$ upper bound on the number of neurons required for a two layer network to learn low degree polynomials up to squared error $ε$ via gradient descent. Even though deep networks can express these polynomials with $O(\mathrm{polylog}(1/ε))$ neurons, the best learning bounds on this problem require $\mathrm{poly}(1/ε)$ neurons.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源