推荐系统培训的异质加速管道

论文标题

Heterogeneous Acceleration Pipeline for Recommendation System Training

论文作者

Adnan, Muhammad, Maboud, Yassaman Ebrahimzadeh, Mahajan, Divya, Nair, Prashant J.

论文摘要

建议模型依靠深度学习网络和大型嵌入表，从而导致计算和记忆密集型过程。这些模型通常是使用Hybrid CPU-GPU或仅GPU的配置训练的。混合模式将GPU的神经网络加速度与CPU的存储器存储和供应供应结合在一起，但可能会产生大量的CPU至GPU传输时间。相反，仅GPU模式使用多个GPU的高带宽内存（HBM）来存储嵌入表。但是，这种方法很昂贵，并且提出了扩展问题。本文介绍了热线，这是一种解决这些问题的异质加速管道。热线通过利用只有几个嵌入条目经常访问（流行）的见解来开发数据感知和模型感知的调度管道。这种方法利用CPU主存储器用于非广泛的嵌入和GPU的HBM用于流行的嵌入。为了实现这一目标，热线加速器片段将迷你批量分为流行和非广泛的微批次。它收集了来自CPU的非受欢迎的微批次的必要工作参数，而GPU则执行流行的微批次。硬件加速器动态地协调了从CPU的主要内存中执行流行的GPU和非受欢迎的嵌入方式。现实世界中的数据集和模型证实了热线的有效性，与英特尔优化的CPU-GPU DLRM基线相比，平均端到端训练时间减少了2.2倍。

Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU's neural network acceleration with the CPUs' memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns. This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs' HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches. It gathers the necessary working parameters for non-popular micro-batches from the CPU, while GPUs execute popular micro-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU's main memory. Real-world datasets and models confirm Hotline's effectiveness, reducing average end-to-end training time by 2.2x compared to Intel-optimized CPU-GPU DLRM baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题