端到端口语理解的两阶段文本知识蒸馏

论文标题

端到端口语理解的两阶段文本知识蒸馏

Two-stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

论文作者

Kim, Seongbin, Kim, Gyuwan, Shin, Seongjin, Lee, Sangmin

论文摘要

端到端的方法通过减轻传统管道系统的缺点，为更准确，更有效的口语理解（SLU）系统打开了一种新方法。以前的作品通过自动语音识别或通过知识蒸馏进行微调来利用SLU模型的文本信息。为了更有效地利用文本信息，这项工作提出了一种两阶段的文本知识蒸馏方法，该方法匹配了在预训练和微调期间，依次匹配两种方式的表示级别的表示和预测的两种模态的逻辑。我们将VQ-WAV2VEC BERT用作语音编码器，因为它捕获了一般和丰富的功能。此外，我们通过随机掩盖了离散音频令牌和上下文化的隐藏表示形式来提高性能，尤其是在低资源场景中，并通过数据增强方法提高了性能。因此，我们在流利的语音命令上推动最先进的方法，在完整数据集设置中达到99.7％的测试准确性，在10％子集设置中达到99.5％。在整个消融研究中，我们从经验上验证了所有使用的方法对最终表现至关重要，这为口语理解提供了最佳实践。代码可在https://github.com/clovaai/textual-kd-slu上找到。

End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or fine-tuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterance-level representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting. Throughout the ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code is available at https://github.com/clovaai/textual-kd-slu.

下载PDF全文

下载文献需遵守相关版权规定

论文标题