论文标题
RayMvsnet:学习基于射线的1D隐式字段,用于准确的多视图立体声
RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo
论文作者
论文摘要
基于学习的多视图立体声(MVS)到迄今为止的成本量的3D卷积左右。由于3D CNN的计算和内存消耗高,输出深度的分辨率通常受到很大限制。与大多数用于自适应成本量的自适应改进的作品不同,我们选择直接优化每个相机射线的深度值,从而模仿激光扫描仪的范围(深度)。这将MVS问题降低到基于射线的深度优化,这比全成本量优化要重得多。特别是,我们提出了RayMvsnet,该rayMvsnet学习了沿每个相机射线的1D隐式字段的顺序预测,并用零交叉点表示场景深度。这种基于变压器特征进行的顺序建模基本上了解了传统多视角立体声的外在线搜索。我们还设计了一个多任务学习,以更好地优化融合和深度精度。在所有基于学习的方法上,我们的方法在DTU和TAMPS数据集中排名最高,在DTU上达到0.33mm的总体重建得分,而F-Score在Tank和Temples上达到了59.48%的F-Score。
Learning-based multi-view stereo (MVS) has by far centered around 3D convolution on cost volumes. Due to the high computation and memory consumption of 3D CNN, the resolution of output depth is often considerably limited. Different from most existing works dedicated to adaptive refinement of cost volumes, we opt to directly optimize the depth value along each camera ray, mimicking the range (depth) finding of a laser scanner. This reduces the MVS problem to ray-based depth optimization which is much more light-weight than full cost volume optimization. In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth. This sequential modeling, conducted based on transformer features, essentially learns the epipolar line search in traditional multi-view stereo. We also devise a multi-task learning for better optimization convergence and depth accuracy. Our method ranks top on both the DTU and the Tanks \& Temples datasets over all previous learning-based methods, achieving overall reconstruction score of 0.33mm on DTU and f-score of 59.48% on Tanks & Temples.