具有元优化帧的有效跨模式视频检索

论文标题

具有元优化帧的有效跨模式视频检索

Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames

论文作者

Han, Ning, Yang, Xun, Lim, Ee-Peng, Chen, Hao, Sun, Qianru

论文摘要

跨模式视频检索旨在将文本作为查询的语义相关视频检索，并且是多媒体中的基本任务之一。大多数表现最佳的方法主要利用视觉变压器（VIT）提取视频功能[1，2，3]，尤其是对于编码长视频而具有很高的VIT计算复杂性。一个常见而简单的解决方案是将视频（而不是使用整个视频）作为VIT的输入均匀地品尝少量数量（例如4或8）的帧（而不是使用整个视频）。框架的数量对VIT的性能有很大的影响，例如使用8帧的性能要比使用4个帧要好，但需要更多的计算资源，从而实现了权衡。为了摆脱这一权衡，本文介绍了一种基于双重优化程序（BOP）的自动视频压缩方法，该方法由模型级别（即基础级别）和框架级别（即元级别）优化组成。模型级学习了一个跨模式视频检索模型，其输入是帧级优化学到的“压缩帧”。反过来，帧级优化是通过在整个视频中计算出的视频检索模型的元丢失来通过梯度下降。我们将此BOP方法以及“压缩帧”称为元优化帧（MOF）。通过合并MOF，视频检索模型能够利用整个视频的信息（用于培训），同时仅在实际实施中摄入少量输入帧。 MOF的收敛由元梯度下降算法保证。为了进行评估，我们对三个大规模基准：MSR-VTT，MSVD和DIDEMO进行了跨模式视频检索的广泛实验。我们的结果表明，MOF是一种促进多种基线方法的通用和高效方法，并且可以实现新的最新性能。

Cross-modal video retrieval aims to retrieve the semantically relevant videos given a text as a query, and is one of the fundamental tasks in Multimedia. Most of top-performing methods primarily leverage Visual Transformer (ViT) to extract video features [1, 2, 3], suffering from high computational complexity of ViT especially for encoding long videos. A common and simple solution is to uniformly sample a small number (say, 4 or 8) of frames from the video (instead of using the whole video) as input to ViT. The number of frames has a strong influence on the performance of ViT, e.g., using 8 frames performs better than using 4 frames yet needs more computational resources, resulting in a trade-off. To get free from this trade-off, this paper introduces an automatic video compression method based on a bilevel optimization program (BOP) consisting of both model-level (i.e., base-level) and frame-level (i.e., meta-level) optimizations. The model-level learns a cross-modal video retrieval model whose input is the "compressed frames" learned by frame-level optimization. In turn, the frame-level optimization is through gradient descent using the meta loss of video retrieval model computed on the whole video. We call this BOP method as well as the "compressed frames" as Meta-Optimized Frames (MOF). By incorporating MOF, the video retrieval model is able to utilize the information of whole videos (for training) while taking only a small number of input frames in actual implementation. The convergence of MOF is guaranteed by meta gradient descent algorithms. For evaluation, we conduct extensive experiments of cross-modal video retrieval on three large-scale benchmarks: MSR-VTT, MSVD, and DiDeMo. Our results show that MOF is a generic and efficient method to boost multiple baseline methods, and can achieve a new state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题