论文标题
视力和语言导航的跨模式地图学习
Cross-modal Map Learning for Vision and Language Navigation
论文作者
论文摘要
我们考虑视力和语言导航(VLN)的问题。当前VLN的大多数方法是使用非结构化内存(例如LSTM)端到端训练的,或者在代理的以自我为中心观察的情况下使用跨模式的关注。与其他作品相反,我们的关键见解是,语言和视觉之间的关联在明确的空间表示中就会更加牢固。在这项工作中,我们为视觉和语言导航提出了一个跨模式图学习模型,该模型首先学会了以观察到的区域和未观察到的区域上以自我为中心地图上的自上而下的语义,然后预测目标作为一组路点的目标。在这两种情况下,该语言都通过跨模式的注意机制来告知预测。我们通过实验测试了可以在图中解决语言驱动的导航的基本假设,然后在完整的VLN-CE基准上显示竞争结果。
We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of waypoints. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark.