论文标题
通过变压器和可学习的回归器进行高保真视觉结构检查
High-Fidelity Visual Structural Inspections through Transformers and Learnable Resizers
论文作者
论文摘要
视觉检查是评估民用基础设施状况的主要技术。无人驾驶汽车(UAV)和人工智能的最新进展使视觉检查更快,更安全,更可靠。配备摄像头的无人机通过为人类检查员收集大量的视觉数据而成为行业中的新标准。同时,使用深度学习算法(包括语义分割)对自主视觉检查进行了重大研究。尽管无人机可以捕获建筑物立面的高分辨率图像,但由于高计算记忆需求,高分辨率分割非常具有挑战性。通常,图像以失去当地细节的价格均匀缩小尺寸。相反,将图像分解成多个较小的斑块可能会导致全局上下文中的形式丧失。我们提出了一种混合战略,可以通过管理全球和本地语义权衡来适应不同的检查任务。该框架包括一个配备了基于注意力的分割模型和可学习的下采样式采样器模块的化合物,高分辨率的深度学习体系结构,该模块旨在最佳效率和成型保留。该框架还利用视觉变压器在图像作物网格上,旨在进行高精度学习而不缩小规模。一种增强的推理技术用于提高性能并重新减少由于网格裁剪而导致的上下文丧失。已经在基于3D物理的图形模型的合成环境中进行了全面的实验。在三个分割任务上使用几个指标评估了所提出的框架:组件类型,组件损伤状态和全局损坏(裂纹,钢筋,剥落)。
Visual inspection is the predominant technique for evaluating the condition of civil infrastructure. The recent advances in unmanned aerial vehicles (UAVs) and artificial intelligence have made the visual inspections faster, safer, and more reliable. Camera-equipped UAVs are becoming the new standard in the industry by collecting massive amounts of visual data for human inspectors. Meanwhile, there has been significant research on autonomous visual inspections using deep learning algorithms, including semantic segmentation. While UAVs can capture high-resolution images of buildings' façades, high-resolution segmentation is extremely challenging due to the high computational memory demands. Typically, images are uniformly downsized at the price of losing fine local details. Contrarily, breaking the images into multiple smaller patches can cause a loss of global contextual in-formation. We propose a hybrid strategy that can adapt to different inspections tasks by managing the global and local semantics trade-off. The framework comprises a compound, high-resolution deep learning architecture equipped with an attention-based segmentation model and learnable downsampler-upsampler modules designed for optimal efficiency and in-formation retention. The framework also utilizes vision transformers on a grid of image crops aiming for high precision learning without downsizing. An augmented inference technique is used to boost the performance and re-duce the possible loss of context due to grid cropping. Comprehensive experiments have been performed on 3D physics-based graphics models synthetic environments in the Quake City dataset. The proposed framework is evaluated using several metrics on three segmentation tasks: component type, component damage state, and global damage (crack, rebar, spalling).