论文标题
SRTR:带有视觉语言知识的自我调查变压器场景图生成
SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation
论文作者
论文摘要
场景中的对象并不总是相关的。单阶段场景图生成方法的执行效率很高,这可以通过稀疏建议集和一些查询来推断实体对之间的有效关系。但是,他们只专注于三胞胎设置主题实体,谓词实体,对象实体,忽略主题与谓词或谓词或谓词或谓词和对象之间的关系,该模型缺乏自我调整能力。此外,语言方式在单阶段的方法中被忽略了。有必要挖掘语言方式知识以提高模型推理能力。为了解决上述缺点,提出了一种具有视觉语言知识(SRTR)的自我复苏变压器,以在模型中增加灵活的自我复杂能力。 SRTR中采用了编码器架构,并开发了一个自我复杂的解码器,以完成三个TRIPLET集合S+O-o-o-o-p,s+p-o和p+o-s的推断。受到大规模训练图像文本基础模型的启发,引入了视觉语言的先验知识,并旨在将视觉语言对准策略旨在将视觉表示形式投射到具有先验知识的语义空间中,以帮助关系推理。视觉基因组数据集上的实验证明了所提出方法的优越性和快速推理能力。
Objects in a scene are not always related. The execution efficiency of the one-stage scene graph generation approaches are quite high, which infer the effective relation between entity pairs using sparse proposal sets and a few queries. However, they only focus on the relation between subject and object in triplet set subject entity, predicate entity, object entity, ignoring the relation between subject and predicate or predicate and object, and the model lacks self-reasoning ability. In addition, linguistic modality has been neglected in the one-stage method. It is necessary to mine linguistic modality knowledge to improve model reasoning ability. To address the above-mentioned shortcomings, a Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model. An encoder-decoder architecture is adopted in SrTR, and a self-reasoning decoder is developed to complete three inferences of the triplet set, s+o-p, s+p-o and p+o-s. Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced and a visual-linguistic alignment strategy is designed to project visual representations into semantic spaces with prior knowledge to aid relational reasoning. Experiments on the Visual Genome dataset demonstrate the superiority and fast inference ability of the proposed method.