知识蒸馏超出模型压缩

论文标题

知识蒸馏超出模型压缩

Knowledge Distillation Beyond Model Compression

论文作者

Sarfraz, Fahad, Arani, Elahe, Zonooz, Bahram

论文摘要

知识蒸馏（KD）通常被视为一种有效的模型压缩技术，其中紧凑型模型（学生）在较大的预审预告型模型或模型集合（教师）的监督下进行培训。自从原始配方以来，已经提出了各种技术，这些技术模仿了教师的不同方面，例如表示空间，决策边界或数据内关系。某些方法通过一群学生之间的协作学习取代了静态老师的单向知识蒸馏。尽管最近取得了进步，但对知识在深度神经网络中的位置以及一种从教师中捕获知识并将其转移给学生的知识的最佳方法仍然是一个悬而未决的问题。在这项研究中，我们对九种不同的KD方法进行了广泛的研究，该研究涵盖了捕获和转移知识的广泛方法。我们在教师和学生之间的容量差距不同的情况下，在不同数据集和网络体系结构上演示了KD框架的多功能性。该研究提供了模仿教师不同方面的影响，并从不同蒸馏方法的性能中获得了见解，以指导更有效的KD方法的设计。此外，我们的研究显示了KD框架在标签噪声和类不平衡的不同程度水平下有效地学习的有效性，从而始终为标准培训提供了概括的增长。我们强调，KD的功效远远超出了模型压缩技术的范围，应将其视为通用训练范式，与标准培训程序相比，在现实世界数据集中，在现实世界数据集中提供了更大的鲁棒性。

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various techniques have been proposed since the original formulation, which mimic different aspects of the teacher such as the representation space, decision boundary, or intra-data relationship. Some methods replace the one-way knowledge distillation from a static teacher with collaborative learning between a cohort of students. Despite the recent advances, a clear understanding of where knowledge resides in a deep neural network and an optimal method for capturing knowledge from teacher and transferring it to student remains an open question. In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge. We demonstrate the versatility of the KD framework on different datasets and network architectures under varying capacity gaps between the teacher and student. The study provides intuition for the effects of mimicking different aspects of the teacher and derives insights from the performance of the different distillation approaches to guide the design of more effective KD methods. Furthermore, our study shows the effectiveness of the KD framework in learning efficiently under varying severity levels of label noise and class imbalance, consistently providing generalization gains over standard training. We emphasize that the efficacy of KD goes much beyond a model compression technique and it should be considered as a general-purpose training paradigm which offers more robustness to common challenges in the real-world datasets compared to the standard training procedure.

下载PDF全文

下载文献需遵守相关版权规定

论文标题