在视频中的视觉接地，用于无监督的单词翻译

论文标题

在视频中的视觉接地，用于无监督的单词翻译

Visual Grounding in Video for Unsupervised Word Translation

论文作者

Sigurdsson, Gunnar A., Alayrac, Jean-Baptiste, Nematzadeh, Aida, Smaira, Lucas, Malinowski, Mateusz, Carreira, João, Blunsom, Phil, Zisserman, Andrew

论文摘要

地球上有成千上万种积极的语言，但有一个视觉世界。在这个视觉世界中的基础有可能弥合所有这些语言之间的差距。我们的目标是使用视觉接地来改善语言之间的无监督单词映射。关键的想法是通过从母语叙述的不配对的教学视频中学习嵌入两种语言之间的常见视觉表示。鉴于这种共享的嵌入，我们证明（i）我们可以在语言之间绘制单词，尤其是“视觉”单词；（ii）共享的嵌入为现有的基于文本的单词翻译技术提供了良好的初始化，为我们提出的混合视觉文本映射算法（MUVE）构成了基础；（iii）我们的方法通过解决基于文本的方法的缺点来实现卓越的性能 - 它更健壮，处理数据集的共同点较少，并且适用于低资源语言。我们应用这些方法将单词从英语翻译成法语，韩语和日语 - 所有这些都没有任何平行的语料库，只需观看许多人在做事情时讲话的视频即可。

There are thousands of actively spoken languages on Earth, but a single visual world. Grounding in this visual world has the potential to bridge the gap between all these languages. Our goal is to use visual grounding to improve unsupervised word mapping between languages. The key idea is to establish a common visual representation between two languages by learning embeddings from unpaired instructional videos narrated in the native language. Given this shared embedding we demonstrate that (i) we can map words between the languages, particularly the 'visual' words; (ii) that the shared embedding provides a good initialization for existing unsupervised text-based word translation techniques, forming the basis for our proposed hybrid visual-text mapping algorithm, MUVE; and (iii) our approach achieves superior performance by addressing the shortcomings of text-based methods -- it is more robust, handles datasets with less commonality, and is applicable to low-resource languages. We apply these methods to translate words from English to French, Korean, and Japanese -- all without any parallel corpora and simply by watching many videos of people speaking while doing things.

下载PDF全文

下载文献需遵守相关版权规定

论文标题