亚当可以收敛而无需对更新规则进行任何修改

论文标题

亚当可以收敛而无需对更新规则进行任何修改

Adam Can Converge Without Any Modification On Update Rules

论文作者

Zhang, Yushun, Chen, Congliang, Shi, Naichen, Sun, Ruoyu, Luo, Zhi-Quan

论文摘要

自Reddi等人以来。 2018年指出了亚当的分歧问题，已经设计了许多新的变体来获得融合。但是，香草·亚当（Vanilla Adam）仍然非常受欢迎，并且在实践中效果很好。为什么理论和实践之间存在差距？我们指出，理论和实践的设置之间存在不匹配：Reddi等。 2018年选择了亚当的超参数后选择问题，即$（β_1，β_2）$;尽管实际应用通常首先解决问题，然后调整$（β_1，β_2）$。由于这一观察，我们猜想只有当我们改变选择问题和超级参数的顺序时，理论上的经验收敛才能是合理的。在这项工作中，我们确认了这一猜想。我们证明，当$β_2$较大并且$β_1<\ sqrt {β_2} <1 $时，Adam会收敛到关键点附近。邻居的大小是随机梯度方差的命题。在额外的条件下（强烈的生长条件），亚当会收敛到关键点。值得一提的是，我们的结果涵盖了广泛的超参数：随着$β_2$的增加，我们的收敛结果可以覆盖[0,1）$中的任何$β_1\，包括$β_1= 0.9 $，这是深度学习库中的默认设置。据我们所知，这是第一个结果表明，亚当可以在其更新规则上进行任何修改而收敛。此外，我们的分析不需要对有限梯度或有界二阶动量的假设。当$β_2$很小时，我们进一步指出了一个$（β_1，β_2）$的大区域，亚当可以在其中差异到无穷大。我们的差异结果认为与我们的收敛结果相同，表明在增加$β_2$时从差异到收敛的相变。这些正面和负面的结果可以提供有关如何调整亚当超级参数的建议。

Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(β_1, β_2)$; while practical applications often fix the problem first and then tune $(β_1, β_2)$. Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $β_2$ is large and $β_1 < \sqrt{β_2}<1$, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. It is worth mentioning that our results cover a wide range of hyperparameters: as $β_2$ increases, our convergence result can cover any $β_1 \in [0,1)$ including $β_1=0.9$, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge without any modification on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When $β_2$ is small, we further point out a large region of $(β_1,β_2)$ where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $β_2$. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题