论文标题
使用反事实模拟测试发现变形金刚和Convnet之间的差异
Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing
论文作者
论文摘要
现代深层神经网络倾向于在静态测试集上进行评估。这是一个事实的一个缺点是,这些深度神经网络无法轻易地在特定场景变化方面针对鲁棒性问题进行评估。例如,很难研究这些网络对物体尺度,对象姿势,场景照明和3D闭塞的鲁棒性。主要原因是收集具有足够量表的精细自然主义变体的真实数据集可能非常耗时且昂贵。在这项工作中,我们提出了反事实模拟测试,这是一种反事实框架,使我们能够通过构建现实的合成场景来研究神经网络在这些自然主义方面的鲁棒性,从而使我们可以向模型提出反事实问题,最终为您的分类提供答案,例如,如果您的分类仍然正确,则可以从对象中查看最佳对象?”或“如果对象部分被另一个对象遮住了,您的分类仍然正确吗?”。我们的方法可以公平地比较这些自然主义变化的最近发布的最先进的卷积神经网络和视觉变压器的鲁棒性。我们发现证据表明,Convnext比Swin更适合姿势和缩放变化,而Swin可以更好地概括我们的模拟域,并且Swin比Convnext更好地处理部分闭塞。我们还发现,所有网络的鲁棒性都随着网络规模以及数据量表和多样性而改善。我们释放自然主义变体对象数据集(NVD),这是一个大型模拟数据集,这些数据集的日常对象的272k图像,具有自然主义变化,例如对象姿势,比例,视点,照明和遮挡。项目页面:https://counterfactualsimulation.github.io
Modern deep neural networks tend to be evaluated on static test sets. One shortcoming of this is the fact that these deep neural networks cannot be easily evaluated for robustness issues with respect to specific scene variations. For example, it is hard to study the robustness of these networks to variations of object scale, object pose, scene lighting and 3D occlusions. The main reason is that collecting real datasets with fine-grained naturalistic variations of sufficient scale can be extremely time-consuming and expensive. In this work, we present Counterfactual Simulation Testing, a counterfactual framework that allows us to study the robustness of neural networks with respect to some of these naturalistic variations by building realistic synthetic scenes that allow us to ask counterfactual questions to the models, ultimately providing answers to questions such as "Would your classification still be correct if the object were viewed from the top?" or "Would your classification still be correct if the object were partially occluded by another object?". Our method allows for a fair comparison of the robustness of recently released, state-of-the-art Convolutional Neural Networks and Vision Transformers, with respect to these naturalistic variations. We find evidence that ConvNext is more robust to pose and scale variations than Swin, that ConvNext generalizes better to our simulated domain and that Swin handles partial occlusion better than ConvNext. We also find that robustness for all networks improves with network scale and with data scale and variety. We release the Naturalistic Variation Object Dataset (NVD), a large simulated dataset of 272k images of everyday objects with naturalistic variations such as object pose, scale, viewpoint, lighting and occlusions. Project page: https://counterfactualsimulation.github.io