论文标题
关于SGD与偏置梯度的收敛性
On the Convergence of SGD with Biased Gradients
论文作者
论文摘要
我们分析了有偏见的随机梯度方法(SGD)的复杂性,其中各个更新被确定性损坏,即偏见的错误项。我们得出平滑(非凸)功能的收敛结果,并在Polyak-Lojasiewicz条件下提供了提高的速率。我们量化偏见的大小如何影响可达到的准确性和收敛速率(有时会导致分歧)。 我们的框架涵盖了许多应用程序,在这些应用程序中,出于绩效原因,只有比无偏见的梯度更新可用或优先。例如,在分布式学习的领域,已经提出了有偏见的梯度压缩技术,例如TOP-K压缩技术,以减轻通信瓶颈和无衍生化优化的工具,只能查询只有偏见的梯度估计器。我们讨论了一些指导示例,以显示我们的分析的广泛适用性。
We analyze the complexity of biased stochastic gradient methods (SGD), where individual updates are corrupted by deterministic, i.e. biased error terms. We derive convergence results for smooth (non-convex) functions and give improved rates under the Polyak-Lojasiewicz condition. We quantify how the magnitude of the bias impacts the attainable accuracy and the convergence rates (sometimes leading to divergence). Our framework covers many applications where either only biased gradient updates are available, or preferred, over unbiased ones for performance reasons. For instance, in the domain of distributed learning, biased gradient compression techniques such as top-k compression have been proposed as a tool to alleviate the communication bottleneck and in derivative-free optimization, only biased gradient estimators can be queried. We discuss a few guiding examples that show the broad applicability of our analysis.