论文标题
通过解释方法比较变压器和CNN的决策机制
Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
论文作者
论文摘要
为了获得有关不同视觉识别主干决策的见解,我们提出了两种方法,分别分解计数和交叉测试,这些方法在整个数据集中系统地应用了深层解释算法,并比较了从解释的数量和性质中生成的统计量。这些方法揭示了网络之间在两个称为组成性和脱节性的属性方面的差异。在建立决策时共同考虑图像的多个部分,而传统的CNN和蒸馏变压器的组成量较低,并且更脱节,这意味着他们使用多个多样性但较小的部分来实现自信的预测。通过进一步的实验,我们确定了归一化的选择在模型的组成性中尤为重要,因为批归归归量表会导致组成性较小,而组和层归一化导致更多。最后,我们还分析了不同骨架共享的功能,并根据其功能使用相似性绘制不同模型的景观。
In order to gain insights about the decision-making of different visual recognition backbones, we propose two methodologies, sub-explanation counting and cross-testing, that systematically applies deep explanation algorithms on a dataset-wide basis, and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional, in the sense that they jointly consider multiple parts of the image in building their decisions, whereas traditional CNNs and distilled transformers are less compositional and more disjunctive, which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments, we pinpointed the choice of normalization to be especially important in the compositionality of a model, in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally, we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.