在分布式机器学习中缓解散布的嵌套梯度代码

论文标题

在分布式机器学习中缓解散布的嵌套梯度代码

Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning

论文作者

Maßny, Luis, Hofmeister, Christoph, Egger, Maximilian, Bitar, Rawad, Wachter-Zeh, Antonia

论文摘要

我们考虑在慢速和反应式的工人节点的存在下分布式学习，称为散乱者。为了减轻散乱者的效果，梯度编码冗余地分配了部分计算，以便只能从非散布工人中恢复总体结果。梯度代码旨在耐受固定数量的散落者。由于实践中的散落者的数量是随机的，并且是未知的先验者，因此容忍固定数量的散落者可以产生次优的计算负载，并可能导致更高的延迟。我们提出了一种梯度编码方案，该方案可以通过仔细串联梯度代码来忍受柔性数量的散落者。通过适当的任务调度和少量的额外信号，我们的方案将工人的计算负载调整为实际数量。我们分析了我们提出的方案的延迟，并表明其延迟的延迟明显低于梯度代码。

We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze the latency of our proposed scheme and show that it has a significantly lower latency than gradient codes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题