神经加速度的位平行矢量合并性

论文标题

神经加速度的位平行矢量合并性

Bit-Parallel Vector Composability for Neural Acceleration

论文作者

Ghodrati, Soroush, Sharma, Hardik, Young, Cliff, Kim, Nam Sung, Esmaeilzadeh, Hadi

论文摘要

常规的神经加速器依赖于孤立的自给自足功能单元，这些功能单元在通过操作数输送 - 聚集逻辑传达结果的同时进行原子操作。每个单元在原子上处理其操作数的所有位，并孤立地产生结果的所有位。本文探讨了一种不同的设计样式，其中每个单元仅负责比特级操作的一部分，以交织并将比特级并行性的好处与深度神经网络中丰富的数据级并行性相结合。这些单元的动态集合在运行时合作，共同生成结果的位。这种合作需要在位之间提取新分组，这只有在操作数和操作是可矢量化的情况下才有可能的。数据级并行性的丰度和主要是重复执行模式，为定义和利用位平行向量合并性的新维度提供了独特的机会。该设计插入数据级并行性中的位平行性，并将两者动态交织在一起。因此，我们的神经加速器的构建块是一个可组合的向量单元，是较窄的Bittwidth矢量发动机的集合，该发动机在位粒度时动态组成或分解。使用六种不同的CNN和LSTM深网，我们在四个设计点上评估了这种设计样式：具有和不使用算法的位宽度异质性，并且没有和没有可用性的高频带芯片内记忆。在这四个设计点中，位平行的矢量合并性带来了（1.4倍至3.5倍）的加速和（1.1倍至2.7倍）。我们还将我们的设计样式与NVIDIA RTX 2080 TI GPU进行了全面比较，该ti GPU也支持INT-4执行。每瓦的性能范围在28.0倍至33.7倍之间。

Conventional neural accelerators rely on isolated self-sufficient functional units that perform an atomic operation while communicating the results through an operand delivery-aggregation logic. Each single unit processes all the bits of their operands atomically and produce all the bits of the results in isolation. This paper explores a different design style, where each unit is only responsible for a slice of the bit-level operations to interleave and combine the benefits of bit-level parallelism with the abundant data-level parallelism in deep neural networks. A dynamic collection of these units cooperate at runtime to generate bits of the results, collectively. Such cooperation requires extracting new grouping between the bits, which is only possible if the operands and operations are vectorizable. The abundance of Data Level Parallelism and mostly repeated execution patterns, provides a unique opportunity to define and leverage this new dimension of Bit-Parallel Vector Composability. This design intersperses bit parallelism within data-level parallelism and dynamically interweaves the two together. As such, the building block of our neural accelerator is a Composable Vector Unit that is a collection of Narrower-Bitwidth Vector Engines, which are dynamically composed or decomposed at the bit granularity. Using six diverse CNN and LSTM deep networks, we evaluate this design style across four design points: with and without algorithmic bitwidth heterogeneity and with and without availability of a high-bandwidth off-chip memory. Across these four design points, Bit-Parallel Vector Composability brings (1.4x to 3.5x) speedup and (1.1x to 2.7x) energy reduction. We also comprehensively compare our design style to the Nvidia RTX 2080 TI GPU, which also supports INT-4 execution. The benefits range between 28.0x and 33.7x improvement in Performance-per-Watt.

下载PDF全文

下载文献需遵守相关版权规定

论文标题