论文标题
部分可观测时空混沌系统的无模型预测
Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing
论文作者
论文摘要
本文提出了一个遵循场景图中给出的某些语义关系约束的共同接地对象的框架。一个典型的自然场景包含几个物体,通常在它们之间表现出各种复杂性的视觉关系。与传统的基于纯对象的本地化任务相比,这些对象之间的关系为提高接地性能提供了强大的上下文提示。场景图是表示图像中所有对象及其语义关系的有效且结构化的方法。为了弥合代表场景的这两种方式并利用上下文信息来改善对象定位,我们严格研究了在自然图像上接地场景图的问题。为此,我们提出了一种基于图形神经网络的新型方法,称为粘性语言消息传递图形神经网络(VL-MPAG NET)。在VL-MPAG NET中,我们首先构建了一个有向图的图形,其对象建议是节点,并且在代表它们之间存在合理关系的一对节点之间的边缘。然后,执行三步间的图形和图内消息传递,以了解建议和查询对象的上下文相关表示。这些对象表示形式用于评分提案以生成对象本地化。所提出的方法在四个公共数据集上的基准大大优于基准。
This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints given in a scene graph. A typical natural scene contains several objects, often exhibiting visual relationships of varied complexities between them. These inter-object relationships provide strong contextual cues toward improving grounding performance compared to a traditional object query-only-based localization task. A scene graph is an efficient and structured way to represent all the objects and their semantic relationships in the image. In an attempt towards bridging these two modalities representing scenes and utilizing contextual information for improving object localization, we rigorously study the problem of grounding scene graphs on natural images. To this end, we propose a novel graph neural network-based approach referred to as Visio-Lingual Message PAssing Graph Neural Network (VL-MPAG Net). In VL-MPAG Net, we first construct a directed graph with object proposals as nodes and an edge between a pair of nodes representing a plausible relation between them. Then a three-step inter-graph and intra-graph message passing is performed to learn the context-dependent representation of the proposals and query objects. These object representations are used to score the proposals to generate object localization. The proposed method significantly outperforms the baselines on four public datasets.