论文标题
Tangobert:使用级联建筑降低推理成本
TangoBERT: Reducing Inference Cost by using Cascaded Architecture
论文作者
论文摘要
在许多NLP任务中,基于伯特,罗伯塔和XLNET等大型基于变压器的模型取得了显着的成功,由于其高度计算负载和能源消耗,货币和环境成本大幅增加。为了减少推理时间的计算负载,我们提出了Tangobert,这是一种级联的模型体系结构,其中首先通过有效但不准确的第一层模型处理实例,并且只有部分实例由效率较低但更准确的第二层模型进行处理。是否应用第二层模型的决定是基于第一层模型产生的置信得分。与基于多层变压器模型的标准级联方法相比,我们的简单方法具有几种吸引人的实践优势。首先,它可以实现更高的加速增长(平均较低延迟)。其次,它利用批处理大小优化用于级联,这增加了相对推断成本降低。我们报告了四个文本分类胶任务和一项阅读理解任务的Tangobert推断CPU加速。实验结果表明,Tangobert的表现优于有效的早期退出基线模型。在SST-2任务上,CPU加速度为8.2倍,其精度为93.9%。
The remarkable success of large transformer-based models such as BERT, RoBERTa and XLNet in many NLP tasks comes with a large increase in monetary and environmental cost due to their high computational load and energy consumption. In order to reduce this computational load in inference time, we present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model, and only part of those instances are additionally processed by a less efficient but more accurate second tier model. The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model. Our simple method has several appealing practical advantages compared to standard cascading approaches based on multi-layered transformer models. First, it enables higher speedup gains (average lower latency). Second, it takes advantage of batch size optimization for cascading, which increases the relative inference cost reductions. We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task. Experimental results show that TangoBERT outperforms efficient early exit baseline models; on the the SST-2 task, it achieves an accuracy of 93.9% with a CPU speedup of 8.2x.