定量：使用极端梯度提升进行快速部署的卷积神经网络的训练后量化

论文标题

定量：使用极端梯度提升进行快速部署的卷积神经网络的训练后量化

Quantune: Post-training Quantization of Convolutional Neural Networks using Extreme Gradient Boosting for Fast Deployment

论文作者

Lee, Jemin, Yu, Misun, Kwon, Yongin, Kim, Taeho

论文摘要

要为一系列资源受限目标采用卷积神经网络（CNN），必须通过执行量化来压缩CNN模型，从而将精度表示转换为较低的位表示。为了克服诸如培训数据集的敏感性，高计算要求以及大量时间消耗的问题，已提出不需要重新培训的训练后量化方法。此外，为了补偿准确性下降而无需再培训，先前对培训后量化的研究提出了几种互补方法：校准，方案，剪裁，粒度和混合精液。为了生成具有最小误差的量化模型，有必要研究方法的所有可能组合，因为每个方法都是互补的，并且CNN模型具有不同的特征。但是，详尽或启发式搜索要么太耗时或次优。为了克服这一挑战，我们提出了一种称为Quantune的自动调节器，该挑战构建了一个梯度树增强模型，以加速搜索量化配置并减少量化误差。我们评估和比较数量与随机，网格和遗传算法。实验结果表明，量化的搜索时间将量化的搜索时间减少了大约36.5倍，而六个CNN模型的准确性损失为0.07〜0.65％，其中包括脆弱的CNN模型（Mobilenet，Squeezenet和Shufflenet）。为了支持多个目标并采用不断发展的量化工程，在成熟的编译器上实施了定量，以作为开源项目进行深度学习。

To adopt convolutional neural networks (CNN) for a range of resource-constrained targets, it is necessary to compress the CNN models by performing quantization, whereby precision representation is converted to a lower bit representation. To overcome problems such as sensitivity of the training dataset, high computational requirements, and large time consumption, post-training quantization methods that do not require retraining have been proposed. In addition, to compensate for the accuracy drop without retraining, previous studies on post-training quantization have proposed several complementary methods: calibration, schemes, clipping, granularity, and mixed-precision. To generate a quantized model with minimal error, it is necessary to study all possible combinations of the methods because each of them is complementary and the CNN models have different characteristics. However, an exhaustive or a heuristic search is either too time-consuming or suboptimal. To overcome this challenge, we propose an auto-tuner known as Quantune, which builds a gradient tree boosting model to accelerate the search for the configurations of quantization and reduce the quantization error. We evaluate and compare Quantune with the random, grid, and genetic algorithms. The experimental results show that Quantune reduces the search time for quantization by approximately 36.5x with an accuracy loss of 0.07 ~ 0.65% across six CNN models, including the fragile ones (MobileNet, SqueezeNet, and ShuffleNet). To support multiple targets and adopt continuously evolving quantization works, Quantune is implemented on a full-fledged compiler for deep learning as an open-sourced project.

下载PDF全文

下载文献需遵守相关版权规定

论文标题