在自我监督的变压器中判断提案的歧视性采样，以弱监督物体本地化

论文标题

在自我监督的变压器中判断提案的歧视性采样，以弱监督物体本地化

Discriminative Sampling of Proposals in Self-Supervised Transformers for Weakly Supervised Object Localization

论文作者

Murtaza, Shakeeb, Belharbi, Soufiane, Pedersoli, Marco, Sarraf, Aydin, Granger, Eric

论文摘要

无人机用于越来越多的视觉识别应用中。细胞塔检查的最新发展是基于无人机的资产监视，在该资产监视中，无人机的自动飞行是通过将感兴趣的对象定位在连续的航空图像中的指导。在本文中，我们提出了一种仅基于图像级标签的深度弱监督对象定位（WSOL）模型的方法，以高信任地定位对象。为了训练我们的本地化器，伪标签是从自我监管的视觉变压器（SST）中有效收获的。但是，由于SST将场景分解为包含各种对象部分的多个地图，并且不依赖任何明确的监督信号，因此它们无法按照所需的WSOL区分感兴趣的对象和其他对象。为了解决这个问题，我们建议利用不同的变压器头部生成的多个地图以获取伪标签来训练深WSOL模型。特别是，引入了一种新的歧视性提案采样（DIPS）方法，该方法依赖于CNN分类器来识别区分区域。然后，从这些区域采样前景和背景像素，以训练WSOL模型，以生成可以准确定位属于特定类的对象的激活图。关于具有挑战性的TELRONE数据集的经验结果表明，我们提出的方法可以超过所产生的地图的广泛阈值范围的最新方法。我们还在CUB数据集上计算了结果，这表明我们的方法可以针对其他任务进行调整。

Drones are employed in a growing number of visual recognition applications. A recent development in cell tower inspection is drone-based asset surveillance, where the autonomous flight of a drone is guided by localizing objects of interest in successive aerial images. In this paper, we propose a method to train deep weakly-supervised object localization (WSOL) models based only on image-class labels to locate object with high confidence. To train our localizer, pseudo labels are efficiently harvested from a self-supervised vision transformers (SSTs). However, since SSTs decompose the scene into multiple maps containing various object parts, and do not rely on any explicit supervisory signal, they cannot distinguish between the object of interest and other objects, as required WSOL. To address this issue, we propose leveraging the multiple maps generated by the different transformer heads to acquire pseudo-labels for training a deep WSOL model. In particular, a new Discriminative Proposals Sampling (DiPS) method is introduced that relies on a CNN classifier to identify discriminative regions. Then, foreground and background pixels are sampled from these regions in order to train a WSOL model for generating activation maps that can accurately localize objects belonging to a specific class. Empirical results on the challenging TelDrone dataset indicate that our proposed approach can outperform state-of-art methods over a wide range of threshold values over produced maps. We also computed results on CUB dataset, showing that our method can be adapted for other tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题