论文标题
两层宽神经网络的均方根误差回归的梯度下降的隐式偏差
Implicit Bias of Gradient Descent for Mean Squared Error Regression with Two-Layer Wide Neural Networks
论文作者
论文摘要
我们研究了广泛的神经网络的梯度下降训练以及功能空间中相应的隐式偏差。对于单变量回归,我们表明训练的解决方案宽度 - $ n $ shandow relu网络在适合培训数据的功能的$ n^{ - 1/2} $之内,其与初始功能的差异是第二个衍生功能最小的第二个衍生功能的差异,该第二个衍生功能由curvature惩罚加权,依赖于概率分布的概率分布,该差额取决于网络参数的概率分布。我们明确计算各种常用初始化程序的曲率惩罚函数。例如,具有均匀分布的不对称初始化会产生恒定的曲率惩罚,因此解决方案函数是训练数据的自然立方样条插值。 \ hj {对于随机梯度下降,我们获得了相同的隐式偏差结果。}我们获得了不同激活函数的相似结果。对于多元回归,我们显示出类似的结果,从而将第二个衍生物被分数拉普拉斯式的ra换变化所取代。对于产生恒定惩罚函数的初始化方案,解决方案是多谐波的。此外,我们表明训练轨迹是通过平滑样条的轨迹捕获的,并具有降低的正则化强度。
We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. \hj{For stochastic gradient descent we obtain the same implicit bias result.} We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.