论文标题
为视频中的对象细分制作3D卷积的案例
Making a Case for 3D Convolutions for Object Segmentation in Videos
论文作者
论文摘要
视频中对象分割的任务通常是通过使用标准2D卷积网络分别处理外观和运动信息来完成的,然后是两个信息来源的学习融合。另一方面,3D卷积网络已成功地用于视频分类任务,但与他们的2D卷积对应物相比,在涉及视频的密集解释的问题上并没有有效利用涉及视频的密集解释,并且在性能方面涉及上述网络背后的问题。在这项工作中,我们表明3D CNN可以有效地应用于密集的视频预测任务,例如显着对象分割。我们提出了一个简单而有效的编码器网络架构,该网络架构完全由3D卷积组成,可以使用标准的跨透镜损失端到端训练。为此,我们利用有效的3D编码器,并提出了一个3D解码器体系结构,其中包括新型的3D全球卷积层和3D修复模块。除了更快的速度外,我们的方法在戴维斯(Davis'16)无监督,FBM和VISAL数据集基准上的差距大大优于现有的最先进,从而表明我们的体系结构可以有效地学习表现性的时空特征并产生高质量的视频掩码。我们已经在https://github.com/sabarim/3dc-seg上公开提供了代码和经过培训的模型。
The task of object segmentation in videos is usually accomplished by processing appearance and motion information separately using standard 2D convolutional networks, followed by a learned fusion of the two sources of information. On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involving dense per-pixel interpretation of videos compared to their 2D convolutional counterparts and lag behind the aforementioned networks in terms of performance. In this work, we show that 3D CNNs can be effectively applied to dense video prediction tasks such as salient object segmentation. We propose a simple yet effective encoder-decoder network architecture consisting entirely of 3D convolutions that can be trained end-to-end using a standard cross-entropy loss. To this end, we leverage an efficient 3D encoder, and propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules. Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal dataset benchmarks in addition to being faster, thus showing that our architecture can efficiently learn expressive spatio-temporal features and produce high quality video segmentation masks. We have made our code and trained models publicly available at https://github.com/sabarim/3DC-Seg.