论文标题
Centerclip:象征性聚类,以进行有效的文本视频检索
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
论文作者
论文摘要
最近,诸如剪辑之类的大规模训练方法在多模式研究(例如文本视频检索)中取得了巨大进展。在剪辑中,变压器对于建模复杂的多模式关系至关重要。但是,在剪辑的视觉变压器中,产生离散的视觉令牌序列的基本视觉令牌过程由于视频中连续和类似帧的冗余性质而产生许多同质令牌。这大大增加了计算成本,并阻碍了Web应用程序中视频检索模型的部署。在本文中,为了减少冗余视频令牌的数量,我们设计了一个多段令牌群集算法来找到最具代表性的令牌并删除非必需的标记。由于框架冗余主要发生在连续的帧中,因此我们将视频分为多个段,并进行细分级的聚类。后来将每个段的中心令牌串联为一个新序列,而它们的原始空间关系得到了很好的维护。我们实例化两种聚类算法,以有效地在高维空间中找到确定性的MEDOIT和迭代分区组。通过这个令牌聚类和中心选择程序,我们通过删除冗余的视觉令牌成功降低了计算成本。该方法进一步增强了视频和文本表示之间的细分级语义对齐,从而从段内框架内实现了令牌的时空相互作用。我们的方法以中心折线的形式通过典型的文本视频基准超过了现有的最新方法,同时将训练记忆成本降低了35 \%,并将推理速度降低了14 \%。该代码可在\ href {https://github.com/mzhaoshuai/centerclip} {https://github.com/mzhaoshuai/centerclip}中获得。
Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35\% and accelerating the inference speed by 14\% at the best case. The code is available at \href{https://github.com/mzhaoshuai/CenterCLIP}{https://github.com/mzhaoshuai/CenterCLIP}.