论文标题
通过平滑和随机梯度来应对良性非洞穴性
Tackling benign nonconvexity with smoothing and stochastic gradients
论文作者
论文摘要
非凸优化问题在机器学习中无处不在,尤其是在深度学习中。尽管使用随机梯度下降(SGD)通常可以在实践中成功优化这种复杂的问题,但理论分析不能充分解释这一成功。特别是,标准分析并未显示非凸功能上SGD的全局收敛性,而是显示了与固定点的收敛性(也可以是局部最小值或鞍点)。我们确定了一类广泛的非convex功能,我们可以证明扰动的SGD(梯度下降被随机噪声扰动 - 涵盖SGD作为一种特殊情况)会收敛到全球最小值(或其邻域)(或其邻域),而梯度下降而没有范围的噪声,而梯度下降却可能会与局部最小值远离全球解决方案。例如,在相对接近凸状(强凸或PL)函数的非凸功能上,我们表明SGD可以线性收敛到全局最佳。
Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success. In particular, the standard analyses do not show global convergence of SGD on non-convex functions, and instead show convergence to stationary points (which can also be local minima or saddle points). We identify a broad class of nonconvex functions for which we can show that perturbed SGD (gradient descent perturbed by stochastic noise -- covering SGD as a special case) converges to a global minimum (or a neighborhood thereof), in contrast to gradient descent without noise that can get stuck in local minima far from a global solution. For example, on non-convex functions that are relatively close to a convex-like (strongly convex or PL) function we show that SGD can converge linearly to a global optimum.