论文标题
框架 - 不足的语义意识全球推理的细分推理
Framework-agnostic Semantically-aware Global Reasoning for Segmentation
论文作者
论文摘要
像素级任务(例如分割)的最新进展说明了可以增强本地特征的基于汇总区域的表示之间的远程相互作用的好处。但是,这种汇总表示,通常以注意力的形式,无法对场景的基本语义进行建模(例如,单个对象,并且通过扩展为其相互作用)。在这项工作中,我们通过提出一个组件来解决该问题,该组件通过使用变压器编码器来生成与原始图像功能融合的上下文化和场景一致的表示形式,以学习将图像特征投影到潜在表示形式中。我们的设计鼓励潜在区域通过确保激活区域在空间上是不相交的,并且此类区域的结合对应于连接的对象段,来表示语义概念。拟议的语义全局推理(SGR)组件是端到端训练的,可以轻松地添加到各种主骨(基于CNN或基于变压器)和分割头(每个像素或掩码分类)以始终如一地改善不同数据集的细分结果。此外,我们的潜在代币在语义上是可解释和多样的,并提供了丰富的功能,可以将其转移到下游任务,例如对象检测和细分,并提高性能。此外,我们还提出了指标来量化两个类\&实例级别的潜在标记的语义。
Recent advances in pixel-level tasks (e.g. segmentation) illustrate the benefit of of long-range interactions between aggregated region-based representations that can enhance local features. However, such aggregated representations, often in the form of attention, fail to model the underlying semantics of the scene (e.g. individual objects and, by extension, their interactions). In this work, we address the issue by proposing a component that learns to project image features into latent representations and reason between them using a transformer encoder to generate contextualized and scene-consistent representations which are fused with original image features. Our design encourages the latent regions to represent semantic concepts by ensuring that the activated regions are spatially disjoint and the union of such regions corresponds to a connected object segment. The proposed semantic global reasoning (SGR) component is end-to-end trainable and can be easily added to a wide variety of backbones (CNN or transformer-based) and segmentation heads (per-pixel or mask classification) to consistently improve the segmentation results on different datasets. In addition, our latent tokens are semantically interpretable and diverse and provide a rich set of features that can be transferred to downstream tasks like object detection and segmentation, with improved performance. Furthermore, we also proposed metrics to quantify the semantics of latent tokens at both class \& instance level.