MixKD：有效蒸馏大型语言模型

论文标题

MixKD：有效蒸馏大型语言模型

MixKD: Towards Efficient Distillation of Large-scale Language Models

论文作者

Liang, Kevin J, Hao, Weituo, Shen, Dinghan, Zhou, Yufan, Chen, Weizhu, Chen, Changyou, Carin, Lawrence

论文摘要

大型语言模型最近表现出令人印象深刻的经验表现。然而，改进的结果是以更大型号的价格，更多的功耗和较慢的推理获得的，这阻碍了它们对低资源（内存和计算）平台的适用性。知识蒸馏（KD）已被证明是压缩此类大型模型的有效框架。但是，大规模的神经网络系统容易记住训练实例，因此，当数据分布稍微更改时，往往会进行不一致的预测。此外，如果有限的特定任务数据可用，学生模型几乎没有机会从教师模型中请求有用的信息。为了解决这些问题，我们提出了MixKD，这是一个利用混音（一种简单而有效的数据增强方法）的数据无关蒸馏框架，以赋予所得模型具有更强的概括能力。具体而言，除了原始的培训示例外，还鼓励学生模型模仿老师的行为，同时还可以在示例对的线性插值上。从理论的角度来看，我们证明，在合理条件下，MixKD会导致概括误差和经验误差之间的差距较小。为了验证其有效性，我们在胶水基准上进行实验，其中MixKD始终导致对标准KD训练的巨大增长，并且表现优于几个竞争基线。在有限数据和消融研究下进行的实验进一步证明了该方法的优势。

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题