跨架构知识蒸馏

论文标题

跨架构知识蒸馏

Cross-Architecture Knowledge Distillation

论文作者

Liu, Yufan, Cao, Jiajiong, Li, Bing, Hu, Weiming, Ding, Jingting, Li, Liang

论文摘要

由于其能够学习全球关系和卓越的表现，变形金刚引起了很多关注。为了实现更高的性能，将互补知识从变压器到卷积神经网络（CNN）很自然。但是，大多数现有的知识蒸馏方法仅考虑同源 - 建筑蒸馏，例如从CNN到CNN的知识蒸馏。在申请跨架构方案时，例如从变压器到CNN，它们可能不合适。为了解决这个问题，提出了一种新颖的跨架构知识蒸馏方法。具体而言，引入了部分交叉注意投影仪和小组线性投影仪，而不是直接模仿老师的输出/中级功能，以使学生的功能与教师的功能保持在两个预计的特征空间中。并进一步提出了多视图的强大训练方案，以提高框架的鲁棒性和稳定性。广泛的实验表明，所提出的方法在小规模和大规模数据集上胜过14个最先进的方法。

Transformer attracts much attention because of its ability to learn global relations and superior performance. In order to achieve higher performance, it is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN). However, most existing knowledge distillation methods only consider homologous-architecture distillation, such as distilling knowledge from CNN to CNN. They may not be suitable when applying to cross-architecture scenarios, such as from Transformer to CNN. To deal with this problem, a novel cross-architecture knowledge distillation method is proposed. Specifically, instead of directly mimicking output/intermediate features of the teacher, partially cross attention projector and group-wise linear projector are introduced to align the student features with the teacher's in two projected feature spaces. And a multi-view robust training scheme is further presented to improve the robustness and stability of the framework. Extensive experiments show that the proposed method outperforms 14 state-of-the-arts on both small-scale and large-scale datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题