论文标题
基于变压器的在线CTC/注意端到端语音识别体系结构
Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture
论文作者
论文摘要
最近,变形金刚在自动语音识别(ASR)领域取得了成功。但是,部署基于变压器的端到端(E2E)模型以进行在线语音识别是一项挑战。在本文中,我们提出了基于变压器的在线CTC/注意E2E ASR体系结构,其中包含块自我发项编码器(块)和基于单调的截断注意力(MTA)基于单调的自我发场解码器(SAD)。首先,Chunk-Sae将演讲分为孤立的块。为了降低计算成本并提高性能,我们提出了状态重用块-ae。 SENCOND,基于MTA的SAD截断了语音的单调特征,并对截断功能进行了关注。为了支持在线识别,我们将国家重用块-SAE和基于MTA的SAD集成到在线CTC/注意体系结构中。我们在HKUST普通话ASR基准测试中评估了所提出的在线模型,并以320毫秒的延迟达到了23.66%的字符错误率(CER)。与离线基线相比,我们的在线模型的绝对CER降低仅为0.19%,并且比我们先前基于长期记忆(LSTM)的在线E2E模型的工作取得了重大改进。
Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.