数千份GPU的多租金机器学习服务的仿真平台

论文标题

数千份GPU的多租金机器学习服务的仿真平台

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs

论文作者

Liang, Ruofan, He, Bingsheng, Yan, Shengen, Sun, Peng

论文摘要

多租户的机器学习服务已成为大量使用GPU资源的数据中心中新兴的数据密集型工作量。由于大规模，许多调整参数和大量资源使用情况，通常不切实际地评估和基准在真实群集上的机器学习服务。在此演示中，我们提出了AnalySim，这是一种群集模拟器，可为多租户的机器学习服务有效设计探索。具体而言，通过跟踪驱动的群集工作负载模拟，AnalySim可以轻松地测试和分析许多性能指标（例如GPU资源利用率）中的各种调度策略。 Analysim基于物理拓扑和逻辑分区模拟群集计算资源。该工具已在Sensetime中使用，以了解不同的调度策略的影响，并具有来自1000多个GPU的真实生产群集的跟踪。我们发现，先发制人和迁移能够显着减少平均工作完成时间并减轻资源分裂问题。

Multi-tenant machine learning services have become emerging data-intensive workloads in data centers with heavy usage of GPU resources. Due to the large scale, many tuning parameters and heavy resource usage, it is usually impractical to evaluate and benchmark those machine learning services on real clusters. In this demonstration, we present AnalySIM, a cluster simulator that allows efficient design explorations for multi-tenant machine learning services. Specifically, by trace-driven cluster workload simulation, AnalySIM can easily test and analyze various scheduling policies in a number of performance metrics such as GPU resource utilization. AnalySIM simulates the cluster computational resource based on both physical topology and logical partition. The tool has been used in SenseTime to understand the impact of different scheduling policies with the trace from a real production cluster of over 1000 GPUs. We find that preemption and migration are able to significantly reduce average job completion time and mitigate the resource fragmentation problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题