流媒体端到端语音识别系统的转移学习方法

论文标题

流媒体端到端语音识别系统的转移学习方法

Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

论文作者

Joshi, Vikas, Zhao, Rui, Mehta, Rupesh R., Kumar, Kshitiz, Li, Jinyu

论文摘要

转移学习（TL）被广泛用于常规混合自动语音识别（ASR）系统，以将知识从源语言转移到目标语言。 TL可以通过通过源语言的预训练模型初始化目标语言的编码器和/或预测网络来将TL应用于诸如复发神经网络传感器（RNN-T）模型之类的端到端（E2E）ASR系统。在混合ASR系统中，通常通过用源语言AM初始化目标语言声学模型（AM）来完成转移学习。在RNN-T框架的情况下，存在几种转移学习策略，具体取决于选择编码器和预测网络的初始化模型。本文介绍了针对RNN-T框架的四种不同TL方法的比较研究。我们在随机初始化的RNN-T模型上使用不同的TL方法显示了17％的相对单词错误率降低。我们还研究了TL的影响，其培训数据范围从50小时到1000小时不等，并显示了TL对具有少量培训数据的语言的功效。

Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language. TL can be applied to end-to-end (E2E) ASR system such as recurrent neural network transducer (RNN-T) models, by initializing the encoder and/or prediction network of the target language with the pre-trained models from source language. In the hybrid ASR system, transfer learning is typically done by initializing the target language acoustic model (AM) with source language AM. Several transfer learning strategies exist in the case of the RNN-T framework, depending upon the choice of the initialization model for encoder and prediction networks. This paper presents a comparative study of four different TL methods for RNN-T framework. We show 17% relative word error rate reduction with different TL methods over randomly initialized RNN-T model. We also study the impact of TL with varying amount of training data ranging from 50 hours to 1000 hours and show the efficacy of TL for languages with small amount of training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题