论文标题
通过监视梯度方向变化,动态调整变压器批处理大小
Dynamically Adjusting Transformer Batch Size by Monitoring Gradient Direction Change
论文作者
论文摘要
超参数的选择会影响神经模型的性能。尽管先前的许多研究(Sutskever等,2013; Duchi等,2011; Kingma和Ba,2015)着重于加速收敛并降低学习率的影响,但相对较少的论文集中在批处理大小的效果上。在本文中,我们分析了批处理大小的增加如何影响梯度方向,并建议评估梯度随角度变化的稳定性。基于我们的观察,梯度方向的角度变化首先趋于稳定(即逐渐减少),同时累积了迷你批次,然后开始波动。我们建议通过在梯度开始波动的时候自动,动态地确定批量大小,并在梯度开始波动时执行优化步骤。为了提高大型模型方法的效率,我们提出了一种采样方法,以选择对批处理大小敏感的参数梯度。我们的方法在训练过程中动态决定了适当有效的批量尺寸。在我们对WMT 14英语对德语和英语进行法语任务的实验中,我们的方法分别用固定的25K批量大小和+0.73和+0.82 BLEU改进了变压器。
The choice of hyper-parameters affects the performance of neural models. While much previous research (Sutskever et al., 2013; Duchi et al., 2011; Kingma and Ba, 2015) focuses on accelerating convergence and reducing the effects of the learning rate, comparatively few papers concentrate on the effect of batch size. In this paper, we analyze how increasing batch size affects gradient direction, and propose to evaluate the stability of gradients with their angle change. Based on our observations, the angle change of gradient direction first tends to stabilize (i.e. gradually decrease) while accumulating mini-batches, and then starts to fluctuate. We propose to automatically and dynamically determine batch sizes by accumulating gradients of mini-batches and performing an optimization step at just the time when the direction of gradients starts to fluctuate. To improve the efficiency of our approach for large models, we propose a sampling approach to select gradients of parameters sensitive to the batch size. Our approach dynamically determines proper and efficient batch sizes during training. In our experiments on the WMT 14 English to German and English to French tasks, our approach improves the Transformer with a fixed 25k batch size by +0.73 and +0.82 BLEU respectively.