自我监督语音模型的上下文感知的微调

论文标题

自我监督语音模型的上下文感知的微调

Context-aware Fine-tuning of Self-supervised Speech Models

论文作者

Shon, Suwon, Wu, Felix, Kim, Kwangyoun, Sridhar, Prashant, Livescu, Karen, Watanabe, Shinji

论文摘要

自我监督的预训练的变形金刚在各种语音任务上改善了艺术的状态。由于自我注意力的二次时间和空间复杂性，它们通常在相对较短（例如话语）段的水平上运行。在本文中，我们在微调过程中研究了上下文的使用，即周围的细分市场的使用，并提出了一种称为上下文感知的微调的新方法。我们将上下文模块附加在预训练模型的最后一层之上，以将整个段编码为上下文嵌入向量，然后用作最终预测的附加功能。在微调阶段，我们引入了一种辅助损失，鼓励这种环境嵌入向量与周围细分市场的上下文矢量相似。这使该模型可以在推理时进行预测，而无需访问这些周围段，并且与标准的微型模型相比，仅需要一个微小的开销。我们使用SLUE和Libri-Light基准测试了几个下游任务：自动语音识别（ASR），命名实体识别（NER）和情感分析（SA）（SA）来评估所提出的方法。结果表明，上下文感知的微调不仅要优于标准的微调基线，而且还可以与强大的上下文注入基线相媲美，该注射基线在推理过程中使用邻近的语音段。

Self-supervised pre-trained transformers have improved the state of the art on a variety of speech tasks. Due to the quadratic time and space complexity of self-attention, they usually operate at the level of relatively short (e.g., utterance) segments. In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. We attach a context module on top of the last layer of a pre-trained model to encode the whole segment into a context embedding vector which is then used as an additional feature for the final prediction. During the fine-tuning stage, we introduce an auxiliary loss that encourages this context embedding vector to be similar to context vectors of surrounding segments. This allows the model to make predictions without access to these surrounding segments at inference time and requires only a tiny overhead compared to standard fine-tuned models. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: Automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). The results show that context-aware fine-tuning not only outperforms a standard fine-tuning baseline but also rivals a strong context injection baseline that uses neighboring speech segments during inference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题