交换器传感器：低潜伏期，较低的帧速率，可流向端到端的语音识别

论文标题

交换器传感器：低潜伏期，较低的帧速率，可流向端到端的语音识别

Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

论文作者

Huang, Wenyong, Hu, Wenchao, Yeung, Yu Ting, Chen, Xiao

论文摘要

在自动语音识别（ASR）中，针对最先进的端到端模型的Transformer取得了竞争性能，并且与基于RNN的模型相比，训练时间要少得多。具有编码器架构的原始变压器仅适用于离线ASR。它依赖于注意对齐方式的注意机制，并在双向上编码输入音频。变压器解码的高计算成本也限制了其在生产流系统中的使用。为了使变压器适合流媒体ASR，我们探索传感器框架，作为学习对齐方式的流媒体方式。对于音频编码，我们将单向变压器应用于交织卷积层。交织的卷积层用于建模未来的上下文，这对性能很重要。为了降低计算成本，我们逐渐将声学输入逐渐下调，也随着卷积层的交织而进行。此外，我们限制了自我注意的历史上下文的长度，以维持每个解码步骤的恒定计算成本。我们表明，该架构（称为Conv-Transformer传感器）在没有外部语言模型的情况下在LibrisPeech数据集（测试清洁上的3.6 \％wer）上实现了竞争性能。该性能与先前发表的流媒体变压器传感器和强大的混合流ASR系统相媲美，并且可以使用较小的look-aphead窗口（140〜ms），较少的参数和较低的帧速率来实现。

Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR. It relies on an attention mechanism to learn alignments, and encodes input audio bidirectionally. The high computation cost of Transformer decoding also limits its use in production streaming systems. To make Transformer suitable for streaming ASR, we explore Transducer framework as a streamable way to learn alignments. For audio encoding, we apply unidirectional Transformer with interleaved convolution layers. The interleaved convolution layers are used for modeling future context which is important to performance. To reduce computation cost, we gradually downsample acoustic input, also with the interleaved convolution layers. Moreover, we limit the length of history context in self-attention to maintain constant computation cost for each decoding step. We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6\% WER on test-clean) without external language models. The performance is comparable to previously published streamable Transformer Transducer and strong hybrid streaming ASR systems, and is achieved with smaller look-ahead window (140~ms), fewer parameters and lower frame rate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题