Alexa教师模型：预处理和提取数十亿参数编码的自然语言理解系统

论文标题

Alexa教师模型：预处理和提取数十亿参数编码的自然语言理解系统

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

论文作者

FitzGerald, Jack, Ananthakrishnan, Shankar, Arkoudas, Konstantine, Bernardi, Davide, Bhagia, Abhishek, Bovi, Claudio Delli, Cao, Jin, Chada, Rakesh, Chauhan, Amit, Chen, Luoxin, Dwarakanath, Anurag, Dwivedi, Satyam, Gojayev, Turan, Gopalakrishnan, Karthik, Gueudre, Thomas, Hakkani-Tur, Dilek, Hamza, Wael, Hueser, Jonathan, Jose, Kevin Martin, Khan, Haidar, Liu, Beiye, Lu, Jianhua, Manzotti, Alessandro, Natarajan, Pradeep, Owczarzak, Karolina, Oz, Gokmen, Palumbo, Enrico, Peris, Charith, Prakash, Chandana Satya, Rawls, Stephen, Rosenbaum, Andy, Shenoy, Anjali, Soltan, Saleh, Sridhar, Mukund Harakere, Tan, Liz, Triefenbach, Fabian, Wei, Pan, Yu, Haiyang, Zheng, Shuai, Tur, Gokhan, Natarajan, Prem

论文摘要

我们提出了一项大规模实验的结果，该实验对编码器的编码器的参数计数范围从700m到9.3b不等，随后将其蒸馏到较小的模型中，范围从17m-170亿参数及其应用，其应用到它们的自然语言理解（NLU）组件（NLU）的组成部分（NLU）。尽管我们使用70％的口语数据训练，但在对书面形式的跨语性自然语言推论（XNLI）语料库进行评估时，我们的教师模型与XLM-R和MT5相当。我们使用系统内的数据对教师模型进行预处理的第二阶段，以将错误率提高了3.86％的意图分类，而相对7.01％的错误率则在插槽填充方面相对7.01％。我们发现，与仅在公共数据（第1阶段）接受培训的2.3B参数老师相比，即使是从我们的2阶段教师模型中提取的1.70亿参数模型也具有2.88％的意图分类和7.69％更好的插槽填充错误率。当使用标记的NLU数据进行离线评估时，我们的17m参数阶段2蒸馏模型的表现分别优于XLM-R碱基（8500万参数）和Distillbert（42m Params），分别高于4.23％至6.14％。最后，我们介绍了一个完整的虚拟助手实验平台的结果，在该平台中，我们发现使用经过训练和蒸馏管道训练的模型超过了从8500万参数教师蒸馏出的模型3.74％-4.91％，以自动测量全系统用户不满意。

We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题