通过辅助任务改善基于RNN传感器的ASR

论文标题

通过辅助任务改善基于RNN传感器的ASR

Improving RNN Transducer Based ASR with Auxiliary Tasks

论文作者

Liu, Chunxi, Zhang, Frank, Le, Duc, Kim, Suyoun, Saraf, Yatharth, Zweig, Geoffrey

论文摘要

与常规混合语音识别器相比，具有单个神经网络的端到端自动语音识别（ASR）模型最近显示出最先进的结果。具体而言，复发性神经网络传感器（RNN-T）在各种基准测试中表现出竞争性的ASR性能。在这项工作中，我们研究了RNN-T通过执行辅助任务实现更好的ASR准确性的方法。我们建议（i）使用与主要RNN-T ASR任务相同的辅助任务，以及（ii）与常规混合建模相同的上下文依赖于上下文依赖性的石墨性状态预测。在使用不同培训数据规模的社交媒体视频转录时，我们首先评估了三种语言的流媒体ASR表现：罗马尼亚语，土耳其语和德语。我们发现两种建议的方法都提供了一致的改进。接下来，我们观察到，这两个辅助任务都表现出在学习RNN -T标准的深层变压器编码方面的功效，从而实现了竞争成果-2.0％/4.2％在Librispeech测试清洁/其他方面与先前的顶级性能模型相比。

End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results compared to conventional hybrid speech recognizers. Specifically, recurrent neural network transducer (RNN-T) has shown competitive ASR performance on various benchmarks. In this work, we examine ways in which RNN-T can achieve better ASR accuracy via performing auxiliary tasks. We propose (i) using the same auxiliary task as primary RNN-T ASR task, and (ii) performing context-dependent graphemic state prediction as in conventional hybrid modeling. In transcribing social media videos with varying training data size, we first evaluate the streaming ASR performance on three languages: Romanian, Turkish and German. We find that both proposed methods provide consistent improvements. Next, we observe that both auxiliary tasks demonstrate efficacy in learning deep transformer encoders for RNN-T criterion, thus achieving competitive results - 2.0%/4.2% WER on LibriSpeech test-clean/other - as compared to prior top performing models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题