论文标题
机器人自主检测人员:一种多模式的深层对比学习方法,可鲁棒与内部变化
Robots Autonomously Detecting People: A Multimodal Deep Contrastive Learning Method Robust to Intraclass Variations
论文作者
论文摘要
在包括医院,长期护理,商店和机场在内的拥挤和/或混乱的以人为中心的环境中的人的机器人发现,因为人们可能会被其他人或物体遮住,并且由于衣服或姿势的变化而变形。由于照明不佳,还可能会丧失判别视觉特征。在本文中,我们提出了一种新型的多模式检测体系结构,以解决类内变化下的人检测的移动机器人问题。我们使用1)一种独特的预训练方法提出了两阶段的训练方法,我们将其定义为时间不变的多模式对比度学习(TIMCLR),以及2)多模式更快的R-CNN(MFRCNN)检测器。 Timclr通过无监督的学习来学习在群内变化下不变的人的表现。我们的方法是独一无二的,因为它还从多模式图像序列中的自然变化生成图像对,除了合成数据增强外,并将跨模式特征对比以在不同模态之间传递侵袭性。 MFRCNN检测器使用这些预验证的特征用于捕获RGB-D图像的人检测。广泛的实验验证了我们在以人为本的拥挤和混乱环境中的DL体系结构的性能。结果表明,我们的方法优于现有的单峰和多模式的人检测方法,在检测身体闭塞和在不同照明条件下构成变形的人方面的检测准确性。
Robotic detection of people in crowded and/or cluttered human-centered environments including hospitals, long-term care, stores and airports is challenging as people can become occluded by other people or objects, and deform due to variations in clothing or pose. There can also be loss of discriminative visual features due to poor lighting. In this paper, we present a novel multimodal person detection architecture to address the mobile robot problem of person detection under intraclass variations. We present a two-stage training approach using 1) a unique pretraining method we define as Temporal Invariant Multimodal Contrastive Learning (TimCLR), and 2) a Multimodal Faster R-CNN (MFRCNN) detector. TimCLR learns person representations that are invariant under intraclass variations through unsupervised learning. Our approach is unique in that it generates image pairs from natural variations within multimodal image sequences, in addition to synthetic data augmentation, and contrasts crossmodal features to transfer invariances between different modalities. These pretrained features are used by the MFRCNN detector for finetuning and person detection from RGB-D images. Extensive experiments validate the performance of our DL architecture in both human-centered crowded and cluttered environments. Results show that our method outperforms existing unimodal and multimodal person detection approaches in terms of detection accuracy in detecting people with body occlusions and pose deformations in different lighting conditions.