Mokey：为开箱即用的浮点变压器模型启用狭窄的定点推断

论文标题

Mokey：为开箱即用的浮点变压器模型启用狭窄的定点推断

Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models

论文作者

Zadeh, Ali Hadi, Mahmoud, Mostafa, Abdelhadi, Ameer, Moshovos, Andreas

论文摘要

越来越大的变压器模型不断提高自然语言处理应用程序的最新精度和能力。这些模型需要更多的计算能力，存储和能源。 Mokey通过将所有值量化为4位索引到代表性的16位固定点的字典，将最先进的32位或16位浮点变压器模型的占地面积减少。 Mokey不需要微调，这是许多许多人都无法使用的培训资源或数据集的基本功能。莫基（Mokey）利用自然发生的值的范围，选择质心值也适合指数曲线。这项独特的功能使Mokey可以用狭窄的3B定点添加量替换大部分原始多重蓄电操作，从而实现面积和节能的硬件加速器设计。在一系列最先进的变压器模型中，Mokey Accelerator可以比基于张量的核心加速器的能源效率提高，同时根据型号和芯片缓冲能力，将性能提高至少$ 4 \ timple $ \ timple $ \ timple $ 15 \倍。可选地，Mokey可以用作任何其他加速器的内存压缩辅助，将宽阔的浮点或定点激活或权重固定为狭窄的4位索引。事实证明，Mokey优于变压器的先前最新量化方法。

Increasingly larger and better Transformer models keep advancing state-of-the-art accuracy and capability for Natural Language Processing applications. These models demand more computational power, storage, and energy. Mokey reduces the footprint of state-of-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids. Mokey does not need fine-tuning, an essential feature as often the training resources or datasets are not available to many. Exploiting the range of values that naturally occur in transformer models, Mokey selects centroid values to also fit an exponential curve. This unique feature enables Mokey to replace the bulk of the original multiply-accumulate operations with narrow 3b fixed-point additions resulting in an area- and energy-efficient hardware accelerator design. Over a set of state-of-the-art transformer models, the Mokey accelerator delivers an order of magnitude improvements in energy efficiency over a Tensor Cores-based accelerator while improving performance by at least $4\times$ and as much as $15\times$ depending on the model and on-chip buffering capacity. Optionally, Mokey can be used as a memory compression assist for any other accelerator, transparently stashing wide floating-point or fixed-point activations or weights into narrow 4-bit indexes. Mokey proves superior to prior state-of-the-art quantization methods for Transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题