论文标题

页面:非convex优化的简单且最佳的概率梯度估计器

PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization

论文作者

Li, Zhize, Bao, Hongyan, Zhang, Xiangliang, Richtárik, Peter

论文摘要

在本文中,我们提出了一种新型的随机梯度估计器 - 概率梯度估计器(PAGE) - 进行非convex优化。页面易于实现,因为它是通过对Vanilla SGD进行的少量调整而设计的:在每次迭代中,Page都使用概率$ P_T $的Vanilla Minibatch SGD更新,或者以较小的计算成本进行较小的调整,以较低的计算成本,概率较低,概率为$ 1-P_T $。我们给出了一个简单的公式,用于最佳选择$ P_T $。此外,我们证明了第一个紧密的下限$ω(n+\ frac {\ sqrt {n}} {ε^2})对于非convex有限符号问题,这也导致了紧密的下下限$ω(b+\ frac {\ frac {\ sqrt {\ sqrt {b}}} \ min \ {\ frac {σ^2} {ε^2},n \} $。然后,我们显示该页面获得最佳收敛结果$ O(n+\ frac {\ sqrt {n}}} {ε^2})$(有限sum)和$ o(b+\ frac {\ frac {\ sqrt {b}}}} {b}}} {ε^2} {ε^2} {ε^2})$ nontonce and nontonce noncome noncone noncone noncond in consect in noncone noncone nonconding noncone fin noncone和condum。此外,我们还表明,对于满足polyak-lojasiewicz(PL)条件的非convex函数,页面可以自动切换到更快的线性收敛速率$ o(\ cdot \ log \ frac \ frac {1}ε)$。最后,我们在Pytorch中的实际数据集上进行了几个深度学习实验(例如LENET,VGG,RESNET),这表明该页面不仅在训练中比SGD快得多,而且还达到了较高的测试准确性,从而验证了最佳的理论结果并确认页面实践优势。

In this paper, we propose a novel stochastic gradient estimator -- ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p_t$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p_t$. We give a simple formula for the optimal choice of $p_t$. Moreover, we prove the first tight lower bound $Ω(n+\frac{\sqrt{n}}{ε^2})$ for nonconvex finite-sum problems, which also leads to a tight lower bound $Ω(b+\frac{\sqrt{b}}{ε^2})$ for nonconvex online problems, where $b:= \min\{\frac{σ^2}{ε^2}, n\}$. Then, we show that PAGE obtains the optimal convergence results $O(n+\frac{\sqrt{n}}{ε^2})$ (finite-sum) and $O(b+\frac{\sqrt{b}}{ε^2})$ (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-Łojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate $O(\cdot\log \frac{1}ε)$. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源