EDA：3D视觉接地的显式文本结束和密集的对齐

论文标题

EDA：3D视觉接地的显式文本结束和密集的对齐

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

论文作者

Wu, Yanmin, Cheng, Xinhua, Zhang, Renrui, Cheng, Zesen, Zhang, Jian

论文摘要

3D视觉接地旨在在自由形式的自然语言描述带有丰富语义提示的自然语言描述中提到的点云中找到对象。但是，现有方法要么提取所有单词耦合的句子级特征，要么更多地关注对象名称，而对象名称将丢失单词级别的信息，或者忽略其他属性。为了减轻这些问题，我们提出EDA，这些EDA明确地将句子中的文本属性解除，并在这种细粒度的语言和点云对象之间进行密集的对齐。具体来说，我们首先提出一个文本解耦模块，以为每个语义组件生成文本功能。然后，我们设计了两种损失，以监督两种方式之间的密集匹配：位置对准损失和语义对准损失。最重要的是，我们进一步引入了一项新的视觉接地任务，找到没有对象名称的对象，可以彻底评估模型的密集对齐能力。通过实验，我们在两个广泛的3D视觉接地数据集（ScanRefer和SR3D/NR3D）上实现了最先进的性能，并在我们新执行的任务上获得了绝对的领导。源代码可从https://github.com/yanmin-wu/eda获得。

3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate these issues, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: position alignment loss and semantic alignment loss. On top of that, we further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted 3D visual grounding datasets, ScanRefer and SR3D/NR3D, and obtain absolute leadership on our newly-proposed task. The source code is available at https://github.com/yanmin-wu/EDA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题