在视觉扎根，自我监督的语音模型中发现单词

论文标题

在视觉扎根，自我监督的语音模型中发现单词

Word Discovery in Visually Grounded, Self-Supervised Speech Models

论文作者

Peng, Puyuan, Harwath, David

论文摘要

我们提出了一种视觉上的口语术语发现的方法。在训练了Hubert或Wav2Vec2.0模型以将字幕与自然图像相关联之后，我们表明，强大的单词分割和聚类能力在模型的自我发挥头内出现。我们的实验表明，在Hubert和Wav2Vec2.0模型基础上，这种能力的存在几乎相同，这表明视觉接地任务是我们观察到的“发现发现”一词的关键组成部分。我们还评估了有关七叶树单词细分和Zerospeech口语术语发现任务的方法，在该任务中，我们的执行方式比当前在几个指标上发表的方法更好。代码和模型权重可从https://github.com/jasonppy/word-discovery获得。

We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we perform on par with or better than currently published methods on several metrics. Code and model weights are available at https://github.com/jasonppy/word-discovery.

下载PDF全文

下载文献需遵守相关版权规定

论文标题