论文标题
热量:用于变压器压缩的硬件有效的自动张量分解
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression
论文作者
论文摘要
变压器在自然语言处理和计算机视觉方面取得了卓越的表现。他们的自我注意事项和前馈层过度参数化,限制了推理速度和能源效率。张量分解是一种有前途的技术,可以通过利用张量代数特性来减少参数冗余,以分解形式表达参数。先前的努力使用了手动或启发式分解设置,而无需进行硬件知觉的自定义,从而导致硬件效率不佳和性能较大。 在这项工作中,我们提出了一个被称为热的硬件感知张量分解框架,可有效探索可能分解的指数空间,并可以自动选择张力化形状和分解等级的选择,并使用硬件感知的合作选择。我们共同研究了张量的收缩路径优化和融合的Einsum映射策略,以弥合理论收益与实际硬件效率提高之间的差距。我们的两阶段知识蒸馏流量可以解决瓶颈的瓶颈,从而显着提高了分解变压器的最终精度。总体而言,我们在实验上表明,我们的硬件分解的BERT变体将能量延迟的产品降低了5.7倍,精度损失小于1.1%,并获得了比手动调整和启发式基准的更好的效率 - 准确性帕累托前沿。
Transformers have attained superior performance in natural language processing and computer vision. Their self-attention and feedforward layers are overparameterized, limiting inference speed and energy efficiency. Tensor decomposition is a promising technique to reduce parameter redundancy by leveraging tensor algebraic properties to express the parameters in a factorized form. Prior efforts used manual or heuristic factorization settings without hardware-aware customization, resulting in poor hardware efficiencies and large performance degradation. In this work, we propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions and automates the choice of tensorization shape and decomposition rank with hardware-aware co-optimization. We jointly investigate tensor contraction path optimizations and a fused Einsum mapping strategy to bridge the gap between theoretical benefits and real hardware efficiency improvement. Our two-stage knowledge distillation flow resolves the trainability bottleneck and thus significantly boosts the final accuracy of factorized Transformers. Overall, we experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss and achieve a better efficiency-accuracy Pareto frontier than hand-tuned and heuristic baselines.