论文标题
视频场景细分的场景一致性表示学习
Scene Consistency Representation Learning for Video Scene Segmentation
论文作者
论文摘要
一个长期视频,例如电影或电视节目,由各种场景组成,每个场景都代表了一系列共享相同语义故事的照片。从长期视频中发现正确的场景边界是一项具有挑战性的任务,因为模型必须了解视频的故事情节才能弄清场景在哪里开始和结束。为此,我们提出了一个有效的自学学习(SSL)框架,以从未标记的长期视频中学习更好的射击表示。更具体地说,我们提出了一个SSL方案,以实现场景一致性,同时探索大量数据增强和改组方法以提高模型的推广性。我们没有像以前的方法那样明确学习场景边界特征,而是引入了一个具有较少感应性偏见的香草时间模型,以验证镜头特征的质量。我们的方法在视频场景细分任务上实现了最新的性能。此外,我们建议一个更公平,更合理的基准测试,以评估视频场景细分方法的性能。该代码可用。
A long-term video, such as a movie or TV show, is composed of various scenes, each of which represents a series of shots sharing the same semantic story. Spotting the correct scene boundary from the long-term video is a challenging task, since a model must understand the storyline of the video to figure out where a scene starts and ends. To this end, we propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from unlabeled long-term videos. More specifically, we present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability. Instead of explicitly learning the scene boundary features as in the previous methods, we introduce a vanilla temporal model with less inductive bias to verify the quality of the shot features. Our method achieves the state-of-the-art performance on the task of Video Scene Segmentation. Additionally, we suggest a more fair and reasonable benchmark to evaluate the performance of Video Scene Segmentation methods. The code is made available.