Aergia：利用联邦学习系统中的异质性

论文标题

Aergia：利用联邦学习系统中的异质性

Aergia: Leveraging Heterogeneity in Federated Learning Systems

论文作者

Cox, Bart, Chen, Lydia Y., Decouchant, Jérémie

论文摘要

联合学习（FL）是一种流行的分布式深度学习方法，可防止中央服务器中大量数据的汇总。 FL依靠客户使用其本地数据集更新全局模型。经典的FL算法使用中央联合会，对于每个培训回合，都等待所有客户在汇总其模型更新之前发送其模型更新。在实际部署中，客户可能具有不同的计算能力和网络功能，这可能会导致慢速客户成为性能瓶颈。以前的工作建议在每个学习回合中使用截止日期，以便联邦人忽略慢速客户的较晚更新，或者使客户在截止日期之前发送部分训练有素的模型。为了加快训练过程，我们提出了Aergia，这是一种新颖的方法，慢慢客户（i）冻结了他们模型的一部分，这是训练中最密集的训练；（ii）训练模型的未散发部分；（iii）将模型中冷冻部分的培训卸载给更快的客户，该客户使用自己的数据集对其进行训练。卸载决策是根据客户报告的培训速度以及其数据集之间的相似性来精心策划的，该数据集之间的相似性得益于可信赖的执行环境。我们通过广泛的实验表明，与FedAvg和TIFL相比，Aergia保持较高的精度，并显着将异质环境下的训练时间显着减少高达27％和53％。

Federated Learning (FL) is a popular approach for distributed deep learning that prevents the pooling of large amounts of data in a central server. FL relies on clients to update a global model using their local datasets. Classical FL algorithms use a central federator that, for each training round, waits for all clients to send their model updates before aggregating them. In practical deployments, clients might have different computing powers and network capabilities, which might lead slow clients to become performance bottlenecks. Previous works have suggested to use a deadline for each learning round so that the federator ignores the late updates of slow clients, or so that clients send partially trained models before the deadline. To speed up the training process, we instead propose Aergia, a novel approach where slow clients (i) freeze the part of their model that is the most computationally intensive to train; (ii) train the unfrozen part of their model; and (iii) offload the training of the frozen part of their model to a faster client that trains it using its own dataset. The offloading decisions are orchestrated by the federator based on the training speed that clients report and on the similarities between their datasets, which are privately evaluated thanks to a trusted execution environment. We show through extensive experiments that Aergia maintains high accuracy and significantly reduces the training time under heterogeneous settings by up to 27% and 53% compared to FedAvg and TiFL, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题