论文标题
视觉接地的复合PCFGS
Visually Grounded Compound PCFGs
论文作者
论文摘要
利用视觉基础来了解语言理解,最近引起了很多关注。在这项工作中,我们研究了视觉扎根的语法诱导,并从未标记的文本及其视觉基础上学习了选区解析器。关于此任务的现有工作(Shi等,2019)通过增强优化解析器,并仅从图像和句子的对齐中得出学习信号。尽管它们的模型总体上相对准确,但其误差分布非常不平衡,某些成分类型的性能较低(例如,在动词短语上的26.2%的召回率,VPS召回26.2%),而其他成分的召回率很高(例如,在名词短语上的79.6%的召回率,NPS)。这并不奇怪,因为学习信号可能不足以推导短语结构语法的各个方面,而梯度估计值嘈杂。我们表明,使用概率无上下文的语法模型的扩展,我们可以进行完全不同的端到端视觉上的学习。此外,这使我们能够用语言建模目标补充图像文本对齐损失。在MSCOCO测试标题上,我们的模型建立了一种新的最新状态,表现优于其非基础版本,因此确认了视觉基础在选区语法诱导中的有效性。它还基本上优于先前的基础模型,对更多“摘要”类别的改进最大(例如, +55.1%的VPS召回)。
Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings. Existing work on this task (Shi et al., 2019) optimizes a parser via Reinforce and derives the learning signal only from the alignment of images and sentences. While their model is relatively accurate overall, its error distribution is very uneven, with low performance on certain constituents types (e.g., 26.2% recall on verb phrases, VPs) and high on others (e.g., 79.6% recall on noun phrases, NPs). This is not surprising as the learning signal is likely insufficient for deriving all aspects of phrase-structure syntax and gradient estimates are noisy. We show that using an extension of probabilistic context-free grammar model we can do fully-differentiable end-to-end visually grounded learning. Additionally, this enables us to complement the image-text alignment loss with a language modeling objective. On the MSCOCO test captions, our model establishes a new state of the art, outperforming its non-grounded version and, thus, confirming the effectiveness of visual groundings in constituency grammar induction. It also substantially outperforms the previous grounded model, with largest improvements on more `abstract' categories (e.g., +55.1% recall on VPs).