论文标题
懒惰和富裕制度中网络的差异限制行为的开始
The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes
论文作者
论文摘要
对于小型训练集的尺寸$ P $,在内核或均值/特征 - 学习制度中,无限宽度神经网络(NN)的误差(nn)的误差很大程度上得到了广泛的神经网络的概括。但是,在关键样本量$ p^*$之后,我们从经验上发现有限宽度网络的概括变得比无限宽度网络更糟糕。在这项工作中,我们从经验上研究了从无限宽度行为到这种差异有限制度的过渡,这是样本量$ p $和网络宽度$ n $的函数。我们发现,有限尺寸效果可能与非常小的数据集大小相关,以使用Relu Networks进行多项式回归的$ p^* \ sim \ sqrt {n} $。我们使用基于NN最终神经切线内核(NTK)的方差的参数讨论这些效果的来源。通过增强功能学习或合奏平均网络,可以将这种过渡推向更大的$ P $。我们发现,使用最终NTK回归的学习曲线是NN学习曲线的准确近似。使用此功能,我们提供了一个玩具模型,该模型还展示了$ p^* \ sim \ sqrt {n} $缩放,并且具有$ p $依赖于功能学习的好处。
For small training set sizes $P$, the generalization error of wide neural networks is well-approximated by the error of an infinite width neural network (NN), either in the kernel or mean-field/feature-learning regime. However, after a critical sample size $P^*$, we empirically find the finite-width network generalization becomes worse than that of the infinite width network. In this work, we empirically study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small dataset sizes on the order of $P^* \sim \sqrt{N}$ for polynomial regression with ReLU networks. We discuss the source of these effects using an argument based on the variance of the NN's final neural tangent kernel (NTK). This transition can be pushed to larger $P$ by enhancing feature learning or by ensemble averaging the networks. We find that the learning curve for regression with the final NTK is an accurate approximation of the NN learning curve. Using this, we provide a toy model which also exhibits $P^* \sim \sqrt{N}$ scaling and has $P$-dependent benefits from feature learning.