论文标题
具有语义解码器的多任务RNN-T用于流语言理解
Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding
论文作者
论文摘要
与传统上级联的管道相比,端到端的口语理解(E2E SLU)由于其联合优化和低潜伏期的优势而引起了人们的兴趣。现有的E2E SLU模型通常遵循两阶段的配置,其中自动语音识别(ASR)网络首先预测了成绩单,然后通过接口将其传递给自然语言理解(NLU)模块以推断语义标签,例如意图和插槽标签。但是,该设计在做笔记预测时并未考虑NLU后部,也不考虑通过考虑先前预测的单词插件立即纠正NLU预测误差。此外,两阶段系统中的NLU模型无法流,因为它必须等待音频段完成处理,这最终会影响SLU系统的延迟。在这项工作中,我们提出了一个可流的多任务语义传感器模型来解决这些注意事项。我们提出的架构可以自动进行ASR和NLU标签,并使用语义解码器来摄取以前预测的单词件和插槽标签,同时通过融合网络汇总它们。使用行业量表SLU和公共FSC数据集,我们显示所提出的模型的表现优于ASR和NLU指标的两阶段E2E SLU模型。
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency when compared to traditionally cascaded pipelines. Existing E2E SLU models usually follow a two-stage configuration where an Automatic Speech Recognition (ASR) network first predicts a transcript which is then passed to a Natural Language Understanding (NLU) module through an interface to infer semantic labels, such as intent and slot tags. This design, however, does not consider the NLU posterior while making transcript predictions, nor correct the NLU prediction error immediately by considering the previously predicted word-pieces. In addition, the NLU model in the two-stage system is not streamable, as it must wait for the audio segments to complete processing, which ultimately impacts the latency of the SLU system. In this work, we propose a streamable multi-task semantic transducer model to address these considerations. Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags while aggregating them through a fusion network. Using an industry scale SLU and a public FSC dataset, we show the proposed model outperforms the two-stage E2E SLU model for both ASR and NLU metrics.