重新访问Bfloat16培训

论文标题

重新访问Bfloat16培训

Revisiting BFloat16 Training

论文作者

Zamirai, Pedram, Zhang, Jian, Aberger, Christopher R., De Sa, Christopher

论文摘要

最先进的通用低精度训练算法使用16位和32位精度的混合，从而创造了一个民间传说，即仅16位硬件计算单元就不足以最大程度地提高模型精度。结果，深度学习加速器被迫支持16位和32位浮点单元（FPU），这比仅使用16位FPU用于硬件设计要高昂。我们问：我们可以仅使用16位浮点单元训练深度学习模型，同时仍然与32位培训达到的模型准确性相匹配？为此，我们研究了广泛采用的BFLOAT16单元的16位FPU培训。尽管这些单元通常使用最近的舍入将输出施加到16位精度，但我们表明，模型重量更新的最接近的舍入通常会取消小型更新，从而降低了收敛性和模型的准确性。在此激励的基础上，我们研究了两种简单的技术在数值分析，随机舍入和Kahan求和方面良好，以补救16位FPU培训中的模型精度降解。我们证明，这两种技术可以在16位FPU培训中实现多达7％的绝对验证精度增益。与七个深度学习应用程序相比，这会导致验证精度降低0.1％至0.2％。

State-of-the-art generic low-precision training algorithms use a mix of 16-bit and 32-bit precision, creating the folklore that 16-bit hardware compute units alone are not enough to maximize model accuracy. As a result, deep learning accelerators are forced to support both 16-bit and 32-bit floating-point units (FPUs), which is more costly than only using 16-bit FPUs for hardware design. We ask: can we train deep learning models only with 16-bit floating-point units, while still matching the model accuracy attained by 32-bit training? Towards this end, we study 16-bit-FPU training on the widely adopted BFloat16 unit. While these units conventionally use nearest rounding to cast output to 16-bit precision, we show that nearest rounding for model weight updates often cancels small updates, which degrades the convergence and model accuracy. Motivated by this, we study two simple techniques well-established in numerical analysis, stochastic rounding and Kahan summation, to remedy the model accuracy degradation in 16-bit-FPU training. We demonstrate that these two techniques can enable up to 7% absolute validation accuracy gain in 16-bit-FPU training. This leads to 0.1% lower to 0.2% higher validation accuracy compared to 32-bit training across seven deep learning applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题