GOBO：量化基于注意力的NLP模型的低潜伏期和节能推断

论文标题

GOBO：量化基于注意力的NLP模型的低潜伏期和节能推断

GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

论文作者

Zadeh, Ali Hadi, Edo, Isak, Awad, Omar Mohamed, Moshovos, Andreas

论文摘要

基于注意力的模型在各种自然语言理解任务中取得了巨大的成功。但是，对于这些模型，由于其大量参数数量，有效的执行仍然是一个挑战。我们提出了GOBO，这是一种模型量化技术，可压缩最先进的BERT模型的32位浮点参数中的绝大多数（通常99.9％），同时保持其准确性。与其他量化方法不同，GOBO不需要微调或重新培训来补偿量化误差。我们提出了Gobo的两个实用硬件应用程序。在第一个GOBO中，减少了存储器存储和流量，因此推断潜伏期和能耗。该GOBO内存压缩机制与许多架构兼容。我们使用TPU，Eyeriss和使用Tensor芯状单元进行体系结构来证明它。其次，我们提出了一个共同设计的硬件体系结构，还可以减少计算。独特的是，GOBO体系结构即使在计算过程中也保持了3B的大部分权重，该属性是：（1）使处理元件区域效率高效，使我们能够为每单位面积的计算功率提供更多的计算功率，（2）用加法替代了大多数多重算法，并且（3）通过在切成片段的位置上降低离子交通。

Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters. We present GOBO, a model quantization technique that compresses the vast majority (typically 99.9%) of the 32-bit floating-point parameters of state-of-the-art BERT models and their variants to 3 bits while maintaining their accuracy. Unlike other quantization methods, GOBO does not require fine-tuning nor retraining to compensate for the quantization error. We present two practical hardware applications of GOBO. In the first GOBO reduces memory storage and traffic and as a result inference latency and energy consumption. This GOBO memory compression mechanism is plug-in compatible with many architectures; we demonstrate it with the TPU, Eyeriss, and an architecture using Tensor Cores-like units. Second, we present a co-designed hardware architecture that also reduces computation. Uniquely, the GOBO architecture maintains most of the weights in 3b even during computation, a property that: (1) makes the processing elements area efficient, allowing us to pack more compute power per unit area, (2) replaces most multiply-accumulations with additions, and (3) reduces the off-chip traffic by amplifying on-chip memory capacity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题