论文标题
神经语言模型的缩放法律
Scaling Laws for Neural Language Models
论文作者
论文摘要
我们研究了有关跨膜损失的语言模型表现的经验缩放定律。损失量表是具有模型大小,数据集大小和用于训练的计算量的幂律,其中一些趋势涵盖了七个以上的数量级。其他建筑细节(例如网络宽度或深度)在广泛的范围内具有最小的影响。简单方程控制了过度拟合对模型/数据集大小的依赖性以及训练速度对型号尺寸的依赖性。这些关系使我们能够确定固定计算预算的最佳分配。较大的模型的样本效率高得多,因此最佳计算训练涉及对相对量的数据进行非常大型模型的训练,并在收敛前大幅停止。
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.