论文标题
深度学习动力学的扩散理论:随机梯度下降指数偏向于平面最小值
A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima
论文作者
论文摘要
随机梯度下降(SGD)及其变体是训练深网的主流方法。众所周知,SGD可以找到一种通常可以很好地概述的平坦最低限度。但是,在数学上尚不清楚如何在如此众多的最小值中选择最小的最低限度。为了定量回答这个问题,我们开发了一个密度扩散理论(DDT),以揭示最小选择如何定量地取决于最小的清晰度和超参数。据我们所知,我们是第一个从理论上和经验上证明的,这受益于Hessian依赖于随机梯度噪声的协方差,SGD比尖锐的minima呈平坦的最小值,而梯度下降(GD),而白噪声则比锋利的Minima flat Polynominory Minima Anly Polynomimnomiminaly Minima仅具有更高的渐变下降(GD)。我们还透露,小型学习率或大批量培训需要成倍的迭代才能从批处理大小和学习率的比率方面逃脱最小值。因此,大批量训练无法在现实的计算时间内有效地搜索平坦的最小值。
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time.