FlexSA：用于高效修剪DNN模型培训的柔性收缩阵列体系结构

论文标题

FlexSA：用于高效修剪DNN模型培训的柔性收缩阵列体系结构

FlexSA: Flexible Systolic Array Architecture for Efficient Pruned DNN Model Training

论文作者

Lym, Sangkug, Erez, Mattan

论文摘要

现代深度学习模型具有很高的记忆力和计算成本。为了使它们快速，存储成本效率，通常使用结构化模型修剪。我们发现，使用具有大型收缩期阵列的通用训练加速器修剪模型非常具有性能感知。为了使收缩期阵列有效地修剪和训练，我们提出了FlexSA，这是一种灵活的收缩阵列体系结构。 FlexSA动态重新配置了收缩阵列结构并提供多种亚音量操作模式，这些模式专为具有不同尺寸和形状的张量的能量和存储器带宽效率加工而设计。我们还提出了培训工作负载中平铺矩阵 - 刺激性和蓄水操作的汇编，以最好地利用Flexsa的资源。根据我们的评估，与具有大型收缩期阵列的常规培训加速器相比，FlexSA通过提议的汇编启发式启发式将修剪和培训现代CNN模型的计算资源利用提高了37％。与天真的收缩阵列分裂相比，FlexSA还将芯片的数据重复利用提高了1.7倍，节省了28％的能量。

Modern deep learning models have high memory and computation cost. To make them fast and memory-cost efficient, structured model pruning is commonly used. We find that pruning a model using a common training accelerator with large systolic arrays is extremely performance-inefficient. To make a systolic array efficient for pruning and training, we propose FlexSA, a flexible systolic array architecture. FlexSA dynamically reconfigures the systolic array structure and offers multiple sub-systolic operating modes, which are designed for energy- and memory bandwidth-efficient processing of tensors with different sizes and shapes. We also present a compilation heuristic for tiling matrix-multiplication-and-accumulation operations in a training workload to best utilize the resources of FlexSA. Based on our evaluation, FlexSA with the proposed compilation heuristic improves compute resource utilization of pruning and training modern CNN models by 37% compared to a conventional training accelerator with a large systolic array. FlexSA also improves on-chip data reuse by 1.7X saving 28% energy compared to naive systolic array splitting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题