论文标题
随机梯度下降的动态依赖状态噪声
Dynamic of Stochastic Gradient Descent with State-Dependent Noise
论文作者
论文摘要
随机梯度下降(SGD)及其变体是训练深神经网络的主流方法。由于神经网络是非凸的,因此越来越多的作品研究了SGD的动态行为及其对其概括的影响,尤其是局部最小值的逃避效率。但是,这些作品采用了过度简化的假设,即SGD中噪声的协方差是(或可以在上限),尽管它实际上是状态依赖性的。在这项工作中,我们对具有状态依赖性噪声的SGD动态行为进行了正式研究。具体而言,我们表明,在当地最小值的局部区域中SGD噪声的协方差是状态的二次函数。因此,我们提出了一种具有状态依赖性扩散的新型幂律动态,以近似SGD的动态。我们证明,幂律动态可以比扁平的最小值快地从锋利的最小值中逃脱,而先前的动力学只能比平坦的minima更快地逃脱锋利的最小值。我们的实验很好地验证了我们的理论结果。受我们的理论的启发,我们建议将额外的依赖性噪声添加到(大批)SGD中,以进一步提高其概括能力。实验验证我们的方法是否有效。
Stochastic gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. Since neural networks are non-convex, more and more works study the dynamic behavior of SGD and the impact to its generalization, especially the escaping efficiency from local minima. However, these works take the over-simplified assumption that the covariance of the noise in SGD is (or can be upper bounded by) constant, although it is actually state-dependent. In this work, we conduct a formal study on the dynamic behavior of SGD with state-dependent noise. Specifically, we show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state. Thus, we propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD. We prove that, power-law dynamic can escape from sharp minima exponentially faster than flat minima, while the previous dynamics can only escape sharp minima polynomially faster than flat minima. Our experiments well verified our theoretical results. Inspired by our theory, we propose to add additional state-dependent noise into (large-batch) SGD to further improve its generalization ability. Experiments verify that our method is effective.