论文标题
关注NERF所需要的一切吗?
Is Attention All That NeRF Needs?
论文作者
论文摘要
我们提出了可推广的NERF变形金刚(GNT),这是一种基于变压器的架构,可重建神经辐射场(NERFS),并学会从源视图中呈现新的观点。虽然先前在NERF上使用手工制作的渲染方程来优化场景表示形式,但GNT实现了神经表示和渲染,这些渲染在两个阶段使用变压器跨场景概括。 (1)视图变压器利用多视图几何形状作为基于注意力的场景表示的电感偏差,并通过在相邻视图上从异性线中汇总信息来预测与坐标的特征。 (2)Ray Transformer使用注意力将新颖的视图呈现出新的视图,以在射线行进过程中沿采样点的视图变压器解码特征。我们的实验表明,当在单个场景上进行优化时,GNT可以成功地重建NERF,而无需明确的渲染公式,因此由于学识渊博的射线渲染器。当在多个场景上接受培训时,GNT在转移到看不见的场景并平均超过所有其他方法时,始终可以实现最先进的性能。我们对学习的注意图来推断深度和遮挡的分析表明,注意力使学习能够学习有物理的渲染。我们的结果表明了变形金刚作为图形的通用建模工具的希望。请参阅我们的项目页面以获取视频结果:https://vita-group.github.io/gnt/。
We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to renders novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicate that attention enables learning a physically-grounded rendering. Our results show the promise of transformers as a universal modeling tool for graphics. Please refer to our project page for video results: https://vita-group.github.io/GNT/.