论文标题
语言模型序列与序列ASR系统的语言密度比率
Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems
论文作者
论文摘要
End-2-End(E2E)模型由于其性能和优势而在某些ASR任务中变得越来越流行。这些E2E模型直接近似鉴于声学输入的代币的后验分布。因此,E2E系统在输出令牌上隐式定义了语言模型(LM),这使得对独立训练的语言模型的开发不如传统的ASR系统不那么直接。这使得很难动态地将E2E ASR系统适应上下文配置文件,以更好地识别诸如命名实体之类的特殊单词。在这项工作中,我们提出了一种培训上下文意识到的E2E模型和将语言模型调整为指定实体的上下文密度比率方法。我们将上述技术应用于E2E ASR系统,该系统转录医生和患者对话,以更好地将E2E系统改编为对话中的名称。我们提出的技术在E2E基线上的相对提高了高达46.5%的相对提高,而不会降低整个测试集的整体识别精度。此外,它还相对超过了上下文浅融合基线的22.1%。
End-2-end (E2E) models have become increasingly popular in some ASR tasks because of their performance and advantages. These E2E models directly approximate the posterior distribution of tokens given the acoustic inputs. Consequently, the E2E systems implicitly define a language model (LM) over the output tokens, which makes the exploitation of independently trained language models less straightforward than in conventional ASR systems. This makes it difficult to dynamically adapt E2E ASR system to contextual profiles for better recognizing special words such as named entities. In this work, we propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities. We apply the aforementioned technique to an E2E ASR system, which transcribes doctor and patient conversations, for better adapting the E2E system to the names in the conversations. Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set. Moreover, it also surpasses a contextual shallow fusion baseline by 22.1 % relative.