论文标题

GOBO:量化基于注意力的NLP模型的低潜伏期和节能推断

GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

论文作者

Zadeh, Ali Hadi, Edo, Isak, Awad, Omar Mohamed, Moshovos, Andreas

论文摘要

基于注意力的模型在各种自然语言理解任务中取得了巨大的成功。但是,对于这些模型,由于其大量参数数量,有效的执行仍然是一个挑战。我们提出了GOBO,这是一种模型量化技术,可压缩最先进的BERT模型的32位浮点参数中的绝大多数(通常99.9%),同时保持其准确性。与其他量化方法不同,GOBO不需要微调或重新培训来补偿量化误差。我们提出了Gobo的两个实用硬件应用程序。在第一个GOBO中,减少了存储器存储和流量,因此推断潜伏期和能耗。该GOBO内存压缩机制与许多架构兼容。我们使用TPU,Eyeriss和使用Tensor芯状单元进行体系结构来证明它。其次,我们提出了一个共同设计的硬件体系结构,还可以减少计算。独特的是,GOBO体系结构即使在计算过程中也保持了3B的大部分权重,该属性是:(1)使处理元件区域效率高效,使我们能够为每单位面积的计算功率提供更多的计算功率,(2)用加法替代了大多数多重算法,并且(3)通过在切成片段的位置上降低离子交通。

Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters. We present GOBO, a model quantization technique that compresses the vast majority (typically 99.9%) of the 32-bit floating-point parameters of state-of-the-art BERT models and their variants to 3 bits while maintaining their accuracy. Unlike other quantization methods, GOBO does not require fine-tuning nor retraining to compensate for the quantization error. We present two practical hardware applications of GOBO. In the first GOBO reduces memory storage and traffic and as a result inference latency and energy consumption. This GOBO memory compression mechanism is plug-in compatible with many architectures; we demonstrate it with the TPU, Eyeriss, and an architecture using Tensor Cores-like units. Second, we present a co-designed hardware architecture that also reduces computation. Uniquely, the GOBO architecture maintains most of the weights in 3b even during computation, a property that: (1) makes the processing elements area efficient, allowing us to pack more compute power per unit area, (2) replaces most multiply-accumulations with additions, and (3) reduces the off-chip traffic by amplifying on-chip memory capacity.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源