论文标题
通过利用外源深度信息来改善像素级对比度学习
Improving Pixel-Level Contrastive Learning by Leveraging Exogenous Depth Information
论文作者
论文摘要
近年来,基于对比度学习(CL)基于对比度学习(CL)的自我监督的表示学习一直是引起人们关注的主题。这是由于对随后的各种任务(特别是分类)获得的出色结果,而无需大量标记的样品。但是,大多数参考CL算法(例如SIMCLR和MOCO,以及BYOL和BARLOW双胞胎)都不适用于像素级的下游任务。一种称为PixPRO的现有解决方案提出了一种基于整个图像中农作物之间的距离对同一图像的正/负图像作物对过滤的像素级方法。我们认为,可以通过将外源数据提供的语义信息作为附加选择滤波器(在训练时间)使用以改善像素级正/负样本的选择来进一步增强这种想法。在本文中,我们将重点介绍深度信息,可以使用深度估计网络或从可用数据(立体电视,视差运动,激光雷达等)中获得的深度信息。场景深度可以提供有意义的提示,以根据其深度区分属于不同对象的像素。我们表明,在对比性损失中使用这些外源信息会带来改善的结果,并且学习的表示形式更好地遵循对象的形状。此外,我们引入了多尺度损失,该损失减轻了寻找适合不同物体大小的训练参数的问题。我们证明了我们的想法对钻孔图像的突破细分的有效性,在该图像上,我们的提高比PixPro的提高了1.9 \%,而在监督基线上的提高了几乎5 \%。我们进一步验证了具有扫描仪和室外场景的室内场景细分任务的技术(分别比PixPro的改进分别提高了1.6 \%和1.1 \%)。
Self-supervised representation learning based on Contrastive Learning (CL) has been the subject of much attention in recent years. This is due to the excellent results obtained on a variety of subsequent tasks (in particular classification), without requiring a large amount of labeled samples. However, most reference CL algorithms (such as SimCLR and MoCo, but also BYOL and Barlow Twins) are not adapted to pixel-level downstream tasks. One existing solution known as PixPro proposes a pixel-level approach that is based on filtering of pairs of positive/negative image crops of the same image using the distance between the crops in the whole image. We argue that this idea can be further enhanced by incorporating semantic information provided by exogenous data as an additional selection filter, which can be used (at training time) to improve the selection of the pixel-level positive/negative samples. In this paper we will focus on the depth information, which can be obtained by using a depth estimation network or measured from available data (stereovision, parallax motion, LiDAR, etc.). Scene depth can provide meaningful cues to distinguish pixels belonging to different objects based on their depth. We show that using this exogenous information in the contrastive loss leads to improved results and that the learned representations better follow the shapes of objects. In addition, we introduce a multi-scale loss that alleviates the issue of finding the training parameters adapted to different object sizes. We demonstrate the effectiveness of our ideas on the Breakout Segmentation on Borehole Images where we achieve an improvement of 1.9\% over PixPro and nearly 5\% over the supervised baseline. We further validate our technique on the indoor scene segmentation tasks with ScanNet and outdoor scenes with CityScapes ( 1.6\% and 1.1\% improvement over PixPro respectively).