使用视觉扎根的语音模型在未转录语音中的关键字本地化

论文标题

使用视觉扎根的语音模型在未转录语音中的关键字本地化

Keyword localisation in untranscribed speech using visually grounded speech models

论文作者

Olaleye, Kayode, Oneata, Dan, Kamper, Herman

论文摘要

关键字本地化是找到给定查询关键字发生在语音语音中的任务。我们使用视觉接地（VGS）模型可以在多大程度上调查关键字本地化。 VGS模型对未标记的图像和口语标题进行了培训。因此，这些模型是自我监督的 - 没有任何明确的文本标签或位置信息训练。为了获得训练目标，我们首先使用带有固定词汇的验证视觉分类器标记使用软文本标签的训练图像。这使VGS模型能够预测话语中书面关键字的存在，而不是其位置。我们考虑了为VGS模型配备本地化功能的四种方法。其中两个 - 显着方法和输入掩蔽 - 可以在训练后将任意预测模型应用于任意预测模型，而其他两个 - 注意力和得分聚合方法 - 直接将其直接纳入模型的结构中。基于蒙版的本地化给出了来自VGS模型的一些最佳报告的本地化分数，当系统知道关键字是在话语中出现并且需要预测其位置时，精度为57％。在检测后执行本地化的环境中，实现了25％的$ F_1 $，并且在首先执行关键字斑点排名通行证的环境中，我们获得了32％的本地化P@10。尽管与理想化的设置相比，这些分数是适度的，并带有无序的单词袋（来自转录），但这些模型未收到任何文本或位置监督。进一步的分析表明，这些模型受到首次检测或排名及格的限制。此外，单个关键字本地化性能与视觉分类器的标记性能相关。我们还质量地展示了语义错误的发生方式和何处，例如当与海洋查询时，该模型会定位冲浪者。

Keyword localisation is the task of finding where in a speech utterance a given query keyword occurs. We investigate to what extent keyword localisation is possible using a visually grounded speech (VGS) model. VGS models are trained on unlabelled images paired with spoken captions. These models are therefore self-supervised -- trained without any explicit textual label or location information. To obtain training targets, we first tag training images with soft text labels using a pretrained visual classifier with a fixed vocabulary. This enables a VGS model to predict the presence of a written keyword in an utterance, but not its location. We consider four ways to equip VGS models with localisations capabilities. Two of these -- a saliency approach and input masking -- can be applied to an arbitrary prediction model after training, while the other two -- attention and a score aggregation approach -- are incorporated directly into the structure of the model. Masked-based localisation gives some of the best reported localisation scores from a VGS model, with an accuracy of 57% when the system knows that a keyword occurs in an utterance and need to predict its location. In a setting where localisation is performed after detection, an $F_1$ of 25% is achieved, and in a setting where a keyword spotting ranking pass is first performed, we get a localisation P@10 of 32%. While these scores are modest compared to the idealised setting with unordered bag-of-word-supervision (from transcriptions), these models do not receive any textual or location supervision. Further analyses show that these models are limited by the first detection or ranking pass. Moreover, individual keyword localisation performance is correlated with the tagging performance from the visual classifier. We also show qualitatively how and where semantic mistakes occur, e.g. that the model locates surfer when queried with ocean.

下载PDF全文

下载文献需遵守相关版权规定

论文标题