基于变压器的在线CTC/注意端到端语音识别体系结构

论文标题

基于变压器的在线CTC/注意端到端语音识别体系结构

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

论文作者

Miao, Haoran, Cheng, Gaofeng, Gao, Changfeng, Zhang, Pengyuan, Yan, Yonghong

论文摘要

最近，变形金刚在自动语音识别（ASR）领域取得了成功。但是，部署基于变压器的端到端（E2E）模型以进行在线语音识别是一项挑战。在本文中，我们提出了基于变压器的在线CTC/注意E2E ASR体系结构，其中包含块自我发项编码器（块）和基于单调的截断注意力（MTA）基于单调的自我发场解码器（SAD）。首先，Chunk-Sae将演讲分为孤立的块。为了降低计算成本并提高性能，我们提出了状态重用块-ae。 SENCOND，基于MTA的SAD截断了语音的单调特征，并对截断功能进行了关注。为了支持在线识别，我们将国家重用块-SAE和基于MTA的SAD集成到在线CTC/注意体系结构中。我们在HKUST普通话ASR基准测试中评估了所提出的在线模型，并以320毫秒的延迟达到了23.66％的字符错误率（CER）。与离线基线相比，我们的在线模型的绝对CER降低仅为0.19％，并且比我们先前基于长期记忆（LSTM）的在线E2E模型的工作取得了重大改进。

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题