看到优势：视觉接地的单词嵌入以更好地捕获人类语义知识

论文标题

看到优势：视觉接地的单词嵌入以更好地捕获人类语义知识

Seeing the advantage: visually grounding word embeddings to better capture human semantic knowledge

论文作者

Merkx, Danny, Frank, Stefan L., Ernestus, Mirjam

论文摘要

分布语义模型捕获了在许多自然语言处理任务中有用的单词级含义，甚至已被证明可以捕获单词含义的认知方面。这些模型中的大多数纯粹是基于文本的，即使人类的感官体验更丰富。在本文中，我们通过将英语文本和图像结合并将其与流行的基于文本的方法进行比较来创建视觉扎根的单词嵌入，以查看视觉信息是否允许我们的模型更好地捕获单词含义的认知方面。我们的分析表明，与纯粹基于文本的嵌入相比，在大型启动实验中，视觉扎根相似性在大型启动实验中更可预测人类反应时间。视觉扎根的嵌入也与人类单词相似性等级息息相关。重要的是，在这两个实验中，我们都表明，即使我们在大型语料库中培训的基于文本的嵌入者也可以说明扎实的嵌入式解释方差的独特部分。这表明，视觉接地使我们的模型可以捕获无法使用文本作为唯一信息来源提取的信息。

Distributional semantic models capture word-level meaning that is useful in many natural language processing tasks and have even been shown to capture cognitive aspects of word meaning. The majority of these models are purely text based, even though the human sensory experience is much richer. In this paper we create visually grounded word embeddings by combining English text and images and compare them to popular text-based methods, to see if visual information allows our model to better capture cognitive aspects of word meaning. Our analysis shows that visually grounded embedding similarities are more predictive of the human reaction times in a large priming experiment than the purely text-based embeddings. The visually grounded embeddings also correlate well with human word similarity ratings. Importantly, in both experiments we show that the grounded embeddings account for a unique portion of explained variance, even when we include text-based embeddings trained on huge corpora. This shows that visual grounding allows our model to capture information that cannot be extracted using text as the only source of information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题