有效跨视频检索的混合对比度量化

论文标题

有效跨视频检索的混合对比度量化

Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval

论文作者

Wang, Jinpeng, Chen, Bin, Liao, Dongliang, Zeng, Ziyun, Li, Gongfu, Xia, Shu-Tao, Xu, Jin

论文摘要

随着最近基于视频的社交平台（例如YouTube和Tiktok）的热潮，使用句子查询的视频检索已成为一个重要的需求，并引起了越来越多的研究关注。尽管表现不错，但视觉和语言社区中现有的文本视频检索模型对于大规模的网络搜索还是不切实际的，因为它们基于高维嵌入方式采用了蛮力搜索。为了提高效率，Web搜索引擎广泛应用矢量压缩库（例如Faiss）来后处理学到的嵌入。不幸的是，与编码功能的单独压缩使表示形式的鲁棒性降低并侵入性能衰减。为了在绩效和效率之间取得更好的平衡，我们提出了跨视频检索的第一个量化表示学习方法，即混合对比量化量化（HCQ）。具体而言，HCQ通过变压器学习粗粒和细粒度的量化，这些量子对文本和视频提供了互补的理解，并保留了全面的语义信息。通过跨视图进行不对称定量的对比学习（AQ-CL），HCQ以粗粒和多个细粒度的水平对齐文本和视频。这种混合粒度的学习策略是跨视频量化模型的强大监督，在该模型中可以相互促进在不同级别的对比学习。在三个网络视频基准数据集上进行的广泛实验表明，HCQ通过最先进的非压缩检索方法实现竞争性能，同时显示出较高的存储和计算效率。代码和配置可在https://github.com/gimpong/www22-hcq上找到。

With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos and preserve comprehensive semantic information. By performing Asymmetric-Quantized Contrastive Learning (AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and multiple fine-grained levels. This hybrid-grained learning strategy serves as strong supervision on the cross-view video quantization model, where contrastive learning at different levels can be mutually promoted. Extensive experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation. Code and configurations are available at https://github.com/gimpong/WWW22-HCQ.

下载PDF全文

下载文献需遵守相关版权规定

论文标题