论文标题
噪声和有限学习速率随机梯度下降的波动
Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent
论文作者
论文摘要
在消失的学习率制度中,随机梯度下降(SGD)现在已经相对众所周知。在这项工作中,我们建议研究SGD的基本特性及其在非变化学习率制度中的变体。重点是得出准确的可解决结果并讨论其含义。这项工作的主要贡献是在有或没有动量的二次损耗函数中得出离散时间SGD的固定分布;特别是,我们结果的一个含义是,离散时间动力学引起的波动会扭曲形状,并且比连续时间理论可以预测的大大大。这项工作中考虑的拟议理论应用的示例包括SGD变体的近似误差,Minibatch噪声的效果,最佳贝叶斯推断,逃脱速率,从尖锐的最小值到最小值以及几种二阶方法的固定协方差,包括潮湿的Newton方法,自然梯度,自然梯度和Adam。
In the vanishing learning rate regime, stochastic gradient descent (SGD) is now relatively well understood. In this work, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and discussing their implications. The main contributions of this work are to derive the stationary distribution for discrete-time SGD in a quadratic loss function with and without momentum; in particular, one implication of our result is that the fluctuation caused by discrete-time dynamics takes a distorted shape and is dramatically larger than a continuous-time theory could predict. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of minibatch noise, the optimal Bayesian inference, the escape rate from a sharp minimum, and the stationary covariance of a few second-order methods including damped Newton's method, natural gradient descent, and Adam.