Srifty：在云上迅速而节俭的分布培训

论文标题

Srifty：在云上迅速而节俭的分布培训

Srifty: Swift and Thrifty Distributed Training on the Cloud

论文作者

Luo, Liang, West, Peter, Krishnamurthy, Arvind, Ceze, Luis

论文摘要

找到最佳的VM配置是实现较低成本和较高吞吐量的关键，这是今天基于云的分布式神经网络（NN）培训的两个主要问题。满足用户约束的最佳VM选择需要有效地导航较大的搜索空间，同时控制与共享云实例和网络相关的性能差异。在这项工作中，我们在分布式NN培训的背景下表征了这一差异，并在各种实例中进行了全面的吞吐量和成本效率研究的结果，以修剪最佳的VM搜索空间。利用这些研究的见解，我们构建了Srifty，该系统将运行时分析与学习性能模型相结合，以准确预测训练性能，并找到满足用户限制的最佳VM选择，并可能利用两个异构设置和现场实例。我们将Srifty与Pytorch集成在一起，并在Amazon EC2上进行了评估。我们对EC2的2K训练设置进行了一项大规模的概括研究。我们的结果表明，Srifty达到了8％的迭代延迟预测错误，其VM实例建议提供了显着的吞吐量增益和成本降低，同时与复杂的现实世界情景中的现有解决方案相比，可以满足用户的限制。

Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks. In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous setups and spot instances. We integrated Srifty with PyTorch and evaluated it on Amazon EC2. We conducted a large-scale generalization study of Srifty across more than 2K training setups on EC2. Our results show that Srifty achieves an iteration latency prediction error of 8%, and its VM instance recommendations offer significant throughput gain and cost reduction while satisfying user constraints compared to existing solutions in complex, real-world scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题