宽阔，然后狭窄：对深薄网络的有效培训

论文标题

宽阔，然后狭窄：对深薄网络的有效培训

Go Wide, Then Narrow: Efficient Training of Deep Thin Networks

论文作者

Zhou, Denny, Ye, Mao, Chen, Chen, Meng, Tianjian, Tan, Mingxing, Song, Xiaodan, Le, Quoc, Liu, Qiang, Schuurmans, Dale

论文摘要

为了将深度学习模型部署到生产中，它必须既准确又紧凑，以满足延迟和内存约束。这通常会导致一个深度（确保性能）且较薄（以提高计算效率）的网络。在本文中，我们提出了一种有效的方法来培训具有理论保证的深薄网络。我们的方法是由模型压缩动机。它由三个阶段组成。首先，我们充分扩大了深薄网络并将其训练直至收敛。然后，我们使用训练有素的深宽网络来热身（或初始化）原始的深薄网络。这是通过图层模仿来实现的，也就是说，迫使薄网络从一层之间模仿宽网络的中间输出。最后，我们进一步调整了这个已经良好的深度薄网络。理论保证是通过使用神经平均场分析来确定的。它证明了我们的层次模仿方法比反向传播的优势。我们还进行大规模的经验实验来验证所提出的方法。通过使用我们的方法培训，RESNET50 CAN的表现要优于RESNET101，而BERT基数可以与Bert Gim that相提并论，当Resnet101和Bert Grone在文献中与标准培训程序进行培训时。

For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. First, we sufficiently widen the deep thin network and train it until convergence. Then, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by layerwise imitation, that is, forcing the thin network to mimic the intermediate outputs of the wide network from layer to layer. Finally, we further fine tune this already well-initialized deep thin network. The theoretical guarantee is established by using the neural mean field analysis. It demonstrates the advantage of our layerwise imitation approach over backpropagation. We also conduct large-scale empirical experiments to validate the proposed method. By training with our method, ResNet50 can outperform ResNet101, and BERT Base can be comparable with BERT Large, when ResNet101 and BERT Large are trained under the standard training procedures as in the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题