TransKD：有效语义分割的变压器知识蒸馏

论文标题

TransKD：有效语义分割的变压器知识蒸馏

TransKD: Transformer Knowledge Distillation for Efficient Semantic Segmentation

论文作者

Liu, Ruiping, Yang, Kailun, Roitberg, Alina, Zhang, Jiaming, Peng, Kunyu, Liu, Huayao, Wang, Yaonan, Stiefelhagen, Rainer

论文摘要

自主驾驶领域中的语义分割基准测试由大型预训练的变压器主导，但是它们的广泛采用受到了大量的计算成本和延长的培训时间的影响。为了提高这种约束，我们从全面的知识蒸馏的角度来研究有效的语义分割，并旨在弥合多源知识提取和特定于变形金刚特定的贴片嵌入之间的差距。我们提出了基于变压器的知识蒸馏（TransKD）框架，该框架通过蒸馏出大型教师变压器的特征地图和补丁嵌入来学习紧凑的学生变形金刚，绕过长期的预训练过程并将FLOPS降低> 85.0％。具体而言，我们建议两个基本模块分别实现特征图蒸馏和贴片嵌入蒸馏：（1）交叉选择性融合（CSF）使知识传递通过通道注意力和层次变压器内的特征图蒸馏之间的跨阶段特征之间的传递；（2）嵌入比对（PEA）在斑块过程中执行尺寸转换，以促进贴片嵌入蒸馏。此外，我们介绍了两个优化模块，以从不同的角度增强贴片嵌入蒸馏：（1）全局本地上下文混合器（GL-Mixer）提取了代表性嵌入的全局和局部信息；（2）嵌入助手（EA）是一种嵌入方法，可以无缝桥接老师和学生模型，并具有老师的渠道数量。关于CityScapes，ACDC，NYUV2和Pascal VOC2012数据集的实验表明，TransKD的表现优于最先进的蒸馏框架，并竞争了耗时的预训练方法。源代码可在https://github.com/ruipingl/transkd上公开获得。

Semantic segmentation benchmarks in the realm of autonomous driving are dominated by large pre-trained transformers, yet their widespread adoption is impeded by substantial computational costs and prolonged training durations. To lift this constraint, we look at efficient semantic segmentation from a perspective of comprehensive knowledge distillation and aim to bridge the gap between multi-source knowledge extractions and transformer-specific patch embeddings. We put forward the Transformer-based Knowledge Distillation (TransKD) framework which learns compact student transformers by distilling both feature maps and patch embeddings of large teacher transformers, bypassing the long pre-training process and reducing the FLOPs by >85.0%. Specifically, we propose two fundamental modules to realize feature map distillation and patch embedding distillation, respectively: (1) Cross Selective Fusion (CSF) enables knowledge transfer between cross-stage features via channel attention and feature map distillation within hierarchical transformers; (2) Patch Embedding Alignment (PEA) performs dimensional transformation within the patchifying process to facilitate the patch embedding distillation. Furthermore, we introduce two optimization modules to enhance the patch embedding distillation from different perspectives: (1) Global-Local Context Mixer (GL-Mixer) extracts both global and local information of a representative embedding; (2) Embedding Assistant (EA) acts as an embedding method to seamlessly bridge teacher and student models with the teacher's number of channels. Experiments on Cityscapes, ACDC, NYUv2, and Pascal VOC2012 datasets show that TransKD outperforms state-of-the-art distillation frameworks and rivals the time-consuming pre-training method. The source code is publicly available at https://github.com/RuipingL/TransKD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题