论文标题
使用视觉和语言描述的人重新识别的卷积基线
A Convolutional Baseline for Person Re-Identification Using Vision and Language Descriptions
论文作者
论文摘要
古典人的重新识别方法假设感兴趣的人已经出现在不同的摄像机上,并且可以通过现有图像之一来查询。但是,在实际监视场景中,经常没有有关查询人员的视觉信息。在这种情况下,证人对人的自然语言描述将为检索提供唯一的信息来源。在这项工作中,在所有可能的画廊和查询场景下都介绍了使用视觉和语言信息的人员重新识别。提出了一个由跨熵损失监督的两条深卷积神经网络框架。将第二层连接到最后一层具有类概率的权重,即在两个网络中共享SoftMax层的逻辑。进行规范相关分析以增强联合潜在嵌入空间中两种方式之间的相关性。为了调查所提出方法的好处,提出了在多模式REID设置下的新测试协议,以用于cuhk-pedes和cuhk-sysu基准的测试拆分。实验结果验证了所提出的系统的优点。与单个模态系统相比,学习的视觉表示更强大,并且在检索过程中执行22 \%。使用多模式查询的检索大大提高了系统的重新识别能力,并在定性上也可以增强。
Classical person re-identification approaches assume that a person of interest has appeared across different cameras and can be queried by one of the existing images. However, in real-world surveillance scenarios, frequently no visual information will be available about the queried person. In such scenarios, a natural language description of the person by a witness will provide the only source of information for retrieval. In this work, person re-identification using both vision and language information is addressed under all possible gallery and query scenarios. A two stream deep convolutional neural network framework supervised by cross entropy loss is presented. The weights connecting the second last layer to the last layer with class probabilities, i.e., logits of softmax layer are shared in both networks. Canonical Correlation Analysis is performed to enhance the correlation between the two modalities in a joint latent embedding space. To investigate the benefits of the proposed approach, a new testing protocol under a multi modal ReID setting is proposed for the test split of the CUHK-PEDES and CUHK-SYSU benchmarks. The experimental results verify the merits of the proposed system. The learnt visual representations are more robust and perform 22\% better during retrieval as compared to a single modality system. The retrieval with a multi modal query greatly enhances the re-identification capability of the system quantitatively as well as qualitatively.