梯度下降具有动量---加速还是超级进度？

论文标题

梯度下降具有动量---加速还是超级进度？

Gradient descent with momentum --- to accelerate or to super-accelerate?

论文作者

Nakerst, Goran, Brennan, John, Haque, Masudul

论文摘要

我们考虑使用“动量”的梯度下降，这是一种在机器学习中最小化的广泛使用的方法。该方法通常与“ Nesterov加速度”一起使用，这意味着梯度不是在参数空间中的当前位置，而是在一步之后的估计位置进行评估。在这项工作中，我们表明，通过在估计的位置上使用梯度，可以扩展此“加速度”，从而提高算法，而不仅仅是领先一步。在这种“超级进程”算法中，人们的远距离观看了多远，这取决于新的超参数。考虑到单参数二次损耗函数，可以精确计算和分析估计超级加速器的最佳值。我们明确地表明，对动量算法进行超级加速是有益的，这不仅对这个理想化的问题，而且对于几种合成损失景观以及使用神经网络的MNIST分类任务。超级加速器也很容易融合到RMSPROP或ADAM等自适应算法中，并显示可改善这些算法。

We consider gradient descent with `momentum', a widely used method for loss function minimization in machine learning. This method is often used with `Nesterov acceleration', meaning that the gradient is evaluated not at the current position in parameter space, but at the estimated position after one step. In this work, we show that the algorithm can be improved by extending this `acceleration' --- by using the gradient at an estimated position several steps ahead rather than just one step ahead. How far one looks ahead in this `super-acceleration' algorithm is determined by a new hyperparameter. Considering a one-parameter quadratic loss function, the optimal value of the super-acceleration can be exactly calculated and analytically estimated. We show explicitly that super-accelerating the momentum algorithm is beneficial, not only for this idealized problem, but also for several synthetic loss landscapes and for the MNIST classification task with neural networks. Super-acceleration is also easy to incorporate into adaptive algorithms like RMSProp or Adam, and is shown to improve these algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题