基于双Swin-Transformer的双重交互式网络，用于RGB-D显着对象检测

论文标题

基于双Swin-Transformer的双重交互式网络，用于RGB-D显着对象检测

Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient Object Detection

论文作者

Zeng, Chao, Kwong, Sam

论文摘要

显着对象检测是预测给定场景中人类参加区域的任务。融合深度信息已被证明在此任务中有效。这个问题的主要挑战是如何从RGB模式和深度模式中汇总互补信息。但是，传统的深层模型在很大程度上依赖CNN特征提取器，并且通常会忽略远距离的依赖性。在这项工作中，我们提出了基于双Swin-Transformer的相互交互式网络。我们采用Swin-Transformer作为RGB和深度模态的特征提取器，以模拟视觉输入中的远程依赖性。在将两个特征分支融合到一个分支之前，将应用基于注意力的模块来增强每种模式的特征。我们设计了一个基于自我注意力的跨模式交互模块和一个封闭式的模态注意模块，以利用两种方式之间的互补信息。对于显着解码，我们创建了通过密集的连接增强的不同阶段，并保持解码的内存，而多级编码功能则同时考虑。考虑到不准确的深度图问题，我们将早期阶段的RGB特征收集到跳过卷积模块中，以提供从RGB模式到最终显着性预测的更多指导。此外，我们添加了边缘监督以使功能学习过程正常。对四个评估指标的五个标准RGB-D SOD基准数据集进行了全面的实验，证明了所提出的DTMINET方法的优越性。

Salient Object Detection is the task of predicting the human attended region in a given scene. Fusing depth information has been proven effective in this task. The main challenge of this problem is how to aggregate the complementary information from RGB modality and depth modality. However, conventional deep models heavily rely on CNN feature extractors, and the long-range contextual dependencies are usually ignored. In this work, we propose Dual Swin-Transformer based Mutual Interactive Network. We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs. Before fusing the two branches of features into one, attention-based modules are applied to enhance features from each modality. We design a self-attention-based cross-modality interaction module and a gated modality attention module to leverage the complementary information between the two modalities. For the saliency decoding, we create different stages enhanced with dense connections and keep a decoding memory while the multi-level encoding features are considered simultaneously. Considering the inaccurate depth map issue, we collect the RGB features of early stages into a skip convolution module to give more guidance from RGB modality to the final saliency prediction. In addition, we add edge supervision to regularize the feature learning process. Comprehensive experiments on five standard RGB-D SOD benchmark datasets over four evaluation metrics demonstrate the superiority of the proposed DTMINet method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题