论文标题
自适应惯性:解散自适应学习率和动力的影响
Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momentum
论文作者
论文摘要
结合自适应学习率和动量的自适应力矩估计(ADAM)将是加速深层神经网络训练的最流行随机优化器。然而,从经验上知道,亚当通常比随机梯度下降(SGD)概括。本文的目的是在扩散理论框架中揭示这种行为的奥秘。具体而言,我们消除了亚当动力学对鞍点逃脱和平坦的最小值选择的自适应学习率和动量的影响。我们证明自适应学习率可以有效地逃脱鞍点,但不能像SGD那样选择平坦的最小值。相比之下,动量提供了一种漂移效果,以帮助训练过程通过鞍点,几乎不会影响平坦的最小选择。这部分解释了为什么SGD(具有动量)更好地推广,而Adam概括较差,但收敛速度更快。此外,在分析的激励下,我们设计了一个名为Adaptive Intertia的新型自适应优化框架,该框架使用参数自适应惯性来加速训练,并证明有利于平面最小值和SGD。我们的广泛实验表明,所提出的自适应惯性方法可以明显地比SGD和常规的自适应梯度方法更好。
Adaptive Moment Estimation (Adam), which combines Adaptive Learning Rate and Momentum, would be the most popular stochastic optimizer for accelerating the training of deep neural networks. However, it is empirically known that Adam often generalizes worse than Stochastic Gradient Descent (SGD). The purpose of this paper is to unveil the mystery of this behavior in the diffusion theoretical framework. Specifically, we disentangle the effects of Adaptive Learning Rate and Momentum of the Adam dynamics on saddle-point escaping and flat minima selection. We prove that Adaptive Learning Rate can escape saddle points efficiently, but cannot select flat minima as SGD does. In contrast, Momentum provides a drift effect to help the training process pass through saddle points, and almost does not affect flat minima selection. This partly explains why SGD (with Momentum) generalizes better, while Adam generalizes worse but converges faster. Furthermore, motivated by the analysis, we design a novel adaptive optimization framework named Adaptive Inertia, which uses parameter-wise adaptive inertia to accelerate the training and provably favors flat minima as well as SGD. Our extensive experiments demonstrate that the proposed adaptive inertia method can generalize significantly better than SGD and conventional adaptive gradient methods.