在手机上实时执行大型语言模型

论文标题

在手机上实时执行大型语言模型

Real-Time Execution of Large-scale Language Models on Mobile

论文作者

Niu, Wei, Kong, Zhenglun, Yuan, Geng, Jiang, Weiwen, Guan, Jiexiong, Ding, Caiwen, Zhao, Pu, Liu, Sijia, Ren, Bin, Wang, Yanzhi

论文摘要

预先训练的大规模语言模型越来越表现出许多自然语言处理（NLP）任务的高精度。但是，硬件平台上的重量存储和计算速度有限阻碍了预培训模型的普及，尤其是在边缘计算时代。在本文中，我们试图为给定的计算大小找到BERT的最佳模型结构，以匹配特定的设备。我们提出了第一个编译器感知的神经体系结构优化框架。我们的框架可以保证已确定的模型可以满足移动设备的资源和实时规格，从而实现了基于BERT变体（例如Bert变体）的大型基于变压器的模型的实时执行。我们在几个NLP任务上评估了我们的模型，从而在移动设备上延迟较低的众所周知的基准上取得了竞争成果。具体而言，与BERT碱基相比，CPU的模型在CPU上的速度更快为5.2倍，而GPU的速度更快为4.1倍，精度损失为0.5-2％。与TensorFlow-Lite相比，我们的整体框架达到了高达7.8倍的速度，仅准确的损失较小。

Pre-trained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the popularity of pre-trained models, especially in the era of edge computing. In this paper, we seek to find the best model structure of BERT for a given computation size to match specific devices. We propose the first compiler-aware neural architecture optimization framework. Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices, thus achieving real-time execution of large transformer-based models like BERT variants. We evaluate our model on several NLP tasks, achieving competitive results on well-known benchmarks with lower latency on mobile devices. Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base. Our overall framework achieves up to 7.8x speedup compared with TensorFlow-Lite with only minor accuracy loss.

下载PDF全文

下载文献需遵守相关版权规定

论文标题