热量：用于变压器压缩的硬件有效的自动张量分解

论文标题

热量：用于变压器压缩的硬件有效的自动张量分解

HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression

论文作者

Gu, Jiaqi, Keller, Ben, Kossaifi, Jean, Anandkumar, Anima, Khailany, Brucek, Pan, David Z.

论文摘要

变压器在自然语言处理和计算机视觉方面取得了卓越的表现。他们的自我注意事项和前馈层过度参数化，限制了推理速度和能源效率。张量分解是一种有前途的技术，可以通过利用张量代数特性来减少参数冗余，以分解形式表达参数。先前的努力使用了手动或启发式分解设置，而无需进行硬件知觉的自定义，从而导致硬件效率不佳和性能较大。在这项工作中，我们提出了一个被称为热的硬件感知张量分解框架，可有效探索可能分解的指数空间，并可以自动选择张力化形状和分解等级的选择，并使用硬件感知的合作选择。我们共同研究了张量的收缩路径优化和融合的Einsum映射策略，以弥合理论收益与实际硬件效率提高之间的差距。我们的两阶段知识蒸馏流量可以解决瓶颈的瓶颈，从而显着提高了分解变压器的最终精度。总体而言，我们在实验上表明，我们的硬件分解的BERT变体将能量延迟的产品降低了5.7倍，精度损失小于1.1％，并获得了比手动调整和启发式基准的更好的效率 - 准确性帕累托前沿。

Transformers have attained superior performance in natural language processing and computer vision. Their self-attention and feedforward layers are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by leveraging tensor algebraic properties to express the parameters in a factorized form. Prior efforts used manual or heuristic factorization settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation. In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. Overall, we experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题