一种用于网络修剪和知识蒸馏的新型体系结构方法

论文标题

一种用于网络修剪和知识蒸馏的新型体系结构方法

A Novel Architecture Slimming Method for Network Pruning and Knowledge Distillation

论文作者

Wang, Dongqi, Zhang, Shengyu, Di, Zhipeng, Lin, Xin, Zhou, Weihua, Wu, Fei

论文摘要

网络修剪和知识蒸馏是两种众所周知的模型压缩方法，可有效降低计算成本和模型大小。修剪和蒸馏中的一个常见问题是确定压缩体系结构，即每层和层配置的确切过滤器数，以保留大多数原始模型容量。尽管现有作品取得了巨大进步，但确定出色的建筑仍然需要人类干扰或巨大的实验。在本文中，我们提出了一种自动化图层配置过程的体系结构减肥方法。我们从以下角度开始，即通过找到保留每层最大参数方差的过滤器数量，可以在很大程度上保留过度参数化模型的容量，从而导致薄体系结构。我们将压缩体系结构的确定确定为单步正交线性变换，并集成原理分析（PCA），其中前几个投影中的过滤器方差最大化。我们通过广泛的实验证明了我们的分析的合理性以及所提出的方法的有效性。特别是，我们表明，在相同的总体压缩率下，由我们的方法确定的压缩体系结构显示了修剪和蒸馏后基准的显着性能增长。令人惊讶的是，我们发现所得层的压缩速率对应于现有作品通过巨大实验发现的层敏感性。

Network pruning and knowledge distillation are two widely-known model compression methods that efficiently reduce computation cost and model size. A common problem in both pruning and distillation is to determine compressed architecture, i.e., the exact number of filters per layer and layer configuration, in order to preserve most of the original model capacity. In spite of the great advances in existing works, the determination of an excellent architecture still requires human interference or tremendous experimentations. In this paper, we propose an architecture slimming method that automates the layer configuration process. We start from the perspective that the capacity of the over-parameterized model can be largely preserved by finding the minimum number of filters preserving the maximum parameter variance per layer, resulting in a thin architecture. We formulate the determination of compressed architecture as a one-step orthogonal linear transformation, and integrate principle component analysis (PCA), where the variances of filters in the first several projections are maximized. We demonstrate the rationality of our analysis and the effectiveness of the proposed method through extensive experiments. In particular, we show that under the same overall compression rate, the compressed architecture determined by our method shows significant performance gain over baselines after pruning and distillation. Surprisingly, we find that the resulting layer-wise compression rates correspond to the layer sensitivities found by existing works through tremendous experimentations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题