Berrut近似编码计算：超出多项式计算的Straggler阻力

论文标题

Berrut近似编码计算：超出多项式计算的Straggler阻力

Berrut Approximated Coded Computing: Straggler Resistance Beyond Polynomial Computing

论文作者

Jahani-Nezhad, Tayyebeh, Maddah-Ali, Mohammad Ali

论文摘要

使用分布式学习来训练具有大数据集的复杂模型的主要挑战之一是处理Stragglers效应。作为解决方案，最近提出了编码的计算，以有效地为计算任务增加冗余。在此技术中，编码是在数据集中使用的，并且对编码数据进行了计算，以使具有一定尺寸的工人节点的任意子集的结果足以恢复最终结果。这些方法的主要挑战是（1）它们仅限于多项式函数计算，（2）我们需要等待的服务器子集的大小随数据集的大小和模型复杂性的乘法而增长（多项式的程度）（多项式的程度），这是很大的，（3）它们在计算上是不稳定的。在本文中，我们建议Berrut近似编码的计算（BACC）作为另一种方法，不仅限于多项式函数计算。此外，使用任何可用工人节点的任意子集的结果，主节点可以大致计算最终结果。事实证明，近似方法在数值上具有较低的计算复杂性。此外，在理论上建立了近似值的准确性，并通过模拟结果在不同的设置（例如分布式学习问题）中进行验证。特别是，BACC用于在一组服务器上训练深层神经网络，该服务器群体的表现优于重复计算（重复编码）。

One of the major challenges in using distributed learning to train complicated models with large data sets is to deal with stragglers effect. As a solution, coded computation has been recently proposed to efficiently add redundancy to the computation tasks. In this technique, coding is used across data sets, and computation is done over coded data, such that the results of an arbitrary subset of worker nodes with a certain size are enough to recover the final results. The major challenges with those approaches are (1) they are limited to polynomial function computations, (2) the size of the subset of servers that we need to wait for grows with the multiplication of the size of the data set and the model complexity (the degree of the polynomial), which can be prohibitively large, (3) they are not numerically stable for computation over real numbers. In this paper, we propose Berrut Approximated Coded Computing (BACC), as an alternative approach, which is not limited to polynomial function computation. In addition, the master node can approximately calculate the final results, using the outcomes of any arbitrary subset of available worker nodes. The approximation approach is proven to be numerically stable with low computational complexity. In addition, the accuracy of the approximation is established theoretically and verified by simulation results in different settings such as distributed learning problems. In particular, BACC is used to train a deep neural network on a cluster of servers, which outperforms repetitive computation (repetition coding) in terms of the rate of convergence.

下载PDF全文

下载文献需遵守相关版权规定

论文标题