DLFusion：用于深神经网络加速器上层融合的自动调整编译器

论文标题

DLFusion：用于深神经网络加速器上层融合的自动调整编译器

DLFusion: An Auto-Tuning Compiler for Layer Fusion on Deep Neural Network Accelerator

论文作者

Liu, Zihan, Leng, Jingwen, Chen, Quan, Li, Chao, Zheng, Wenli, Li, Li, Guo, Minyi

论文摘要

许多硬件供应商由于其卓越的性能和效率而引入了专门的深神经网络（DNN）加速器。因此，如何生成和优化硬件加速器的代码成为一个重要而探索的问题。在本文中，我们使用新颖和代表性的Cambricon DNN加速器进行编译器阶段优化研究，并证明代码优化旋钮在释放硬件计算马力的潜力方面起着重要作用。但是，即使只有两个研究的代码优化旋钮，即核心和层融合方案的数量，也提供了一个巨大的搜索空间，可防止幼稚的蛮力搜索。这项工作引入了一个联合自动调节优化框架，以应对这一挑战。我们首先使用一组合成的DNN层来研究硬件性能和层特性之间的相互作用。根据洞察力，我们将操作计数和特征图通道大小提取为每一层的特征，并得出一个关节优化策略，以决定性能 - 最佳的核心数和融合方案。我们使用一组代表性的DNN模型评估了提出方法的性能，并表明它的最低限度为3.6倍，而没有优化基线，它的最低效果为7.9倍性能加速。我们还表明，所达到的加速与基于减少的蛮力搜索但搜索时间更少的Oracle情况接近。

Many hardware vendors have introduced specialized deep neural networks (DNN) accelerators owing to their superior performance and efficiency. As such, how to generate and optimize the code for the hardware accelerator becomes an important yet less explored problem. In this paper, we perform the compiler-stage optimization study using a novel and representative Cambricon DNN accelerator and demonstrate that the code optimization knobs play an important role in unleashing the potential of hardware computational horsepower. However, even only two studied code optimization knobs, namely the number of cores and layer fusion scheme, present an enormous search space that prevents the naive brute-force search. This work introduces a joint, auto-tuning optimization framework to address this challenge. We first use a set of synthesized DNN layers to study the interplay between the hardware performance and layer characteristics. Based on the insights, we extract the operation count and feature map channel size as each layer's characteristics and derive a joint optimization strategy to decide the performance-optimal core number and fusion scheme. We evaluate the performance of the proposed approach using a set of representative DNN models and show that it achieves the minimal of 3.6x and the maximal of 7.9x performance speedup compared to no optimization baseline. We also show that the achieved speedup is close to the oracle case that is based on a reduced brute-force search but with much less search time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题