SqueezeFormer：自动语音识别的有效变压器

论文标题

SqueezeFormer：自动语音识别的有效变压器

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

论文作者

Kim, Sehoon, Gholami, Amir, Shaw, Albert, Lee, Nicholas, Mangalam, Karttikeya, Malik, Jitendra, Mahoney, Michael W., Keutzer, Kurt

论文摘要

最近提出的构象模型已成为基于其混合注意卷积体系结构的各种下游语音任务的事实上的骨干模型，该架构既捕获了本地和全球特征。但是，通过一系列系统的研究，我们发现构象体架构的设计选择不是最佳的。在重新检查了构象异构体的宏观和微体系结构的设计选择之后，我们提出了SqueezeFormer，在相同的训练方案下始终优于最先进的ASR模型。特别是，对于宏观结构，SqueezeFormer结合了（i）暂时的U-NET结构，该结构降低了长序列上多头注意模块的成本，以及（ii）多头注意或卷积模块的简单块结构，随后是进料 - 前向模块，而不是馈电模块，而不是Macaron结构，而不是在配料中提出的Macaron结构。此外，对于微体系结构，挤压器（i）简化了卷积块中的激活，（ii）删除了冗余层的归一化操作，（iii）结合了有效的深度下采样层，以有效地将输入信号子样本子样本。 Squeexmer在Librispeech测试中获得了7.5％，6.5％和6.0％的单词率率（WER）的最新结果，而没有外部语言模型，比具有相同数量的FLOPS数量的Conformer-CTC优于3.1％，1.4％和0.6％。我们的代码是开源的，可以在线提供。

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题