论文标题
Alexa教师模型:预处理和提取数十亿参数编码的自然语言理解系统
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
论文作者
论文摘要
我们提出了一项大规模实验的结果,该实验对编码器的编码器的参数计数范围从700m到9.3b不等,随后将其蒸馏到较小的模型中,范围从17m-170亿参数及其应用,其应用到它们的自然语言理解(NLU)组件(NLU)的组成部分(NLU)。尽管我们使用70%的口语数据训练,但在对书面形式的跨语性自然语言推论(XNLI)语料库进行评估时,我们的教师模型与XLM-R和MT5相当。我们使用系统内的数据对教师模型进行预处理的第二阶段,以将错误率提高了3.86%的意图分类,而相对7.01%的错误率则在插槽填充方面相对7.01%。我们发现,与仅在公共数据(第1阶段)接受培训的2.3B参数老师相比,即使是从我们的2阶段教师模型中提取的1.70亿参数模型也具有2.88%的意图分类和7.69%更好的插槽填充错误率。当使用标记的NLU数据进行离线评估时,我们的17m参数阶段2蒸馏模型的表现分别优于XLM-R碱基(8500万参数)和Distillbert(42m Params),分别高于4.23%至6.14%。最后,我们介绍了一个完整的虚拟助手实验平台的结果,在该平台中,我们发现使用经过训练和蒸馏管道训练的模型超过了从8500万参数教师蒸馏出的模型3.74%-4.91%,以自动测量全系统用户不满意。
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.