仔细研究分布式培训的编码

论文标题

仔细研究分布式培训的编码

A Closer Look at Codistillation for Distributed Training

论文作者

Sodhani, Shagun, Delalleau, Olivier, Assran, Mahmoud, Sinha, Koustuv, Ballas, Nicolas, Rabbat, Michael

论文摘要

已提出编码鉴定作为一种机制，可以通过鼓励通过辅助损失来代表相同功能的同时训练模型之间的知识。这与更常用的完全同步数据并行的随机梯度下降方法形成对比，其中不同的模型复制品在每次迭代中平均其梯度（或参数），从而维持相同的参数。我们研究了分布式培训设置中的编码，并补充了以前的工作，该工作集中在非常大的批量上。令人惊讶的是，我们发现，即使在中等批量的大小上，接受密码验证训练的模型也可以执行，尽管使用了较弱的同步机制，但使用同步数据并行方法训练的模型也可以执行。这些发现在各种批处理大小和学习率计划以及各种模型和数据集中。但是，获得这种准确性水平需要适当地考虑编码的正则作用，我们通过几个经验观察来强调。总体而言，这项工作有助于更好地理解编码，以及如何在分布式计算环境中最好地利用它。

Codistillation has been proposed as a mechanism to share knowledge among concurrently trained models by encouraging them to represent the same function through an auxiliary loss. This contrasts with the more commonly used fully-synchronous data-parallel stochastic gradient descent methods, where different model replicas average their gradients (or parameters) at every iteration and thus maintain identical parameters. We investigate codistillation in a distributed training setup, complementing previous work which focused on extremely large batch sizes. Surprisingly, we find that even at moderate batch sizes, models trained with codistillation can perform as well as models trained with synchronous data-parallel methods, despite using a much weaker synchronization mechanism. These findings hold across a range of batch sizes and learning rate schedules, as well as different kinds of models and datasets. Obtaining this level of accuracy, however, requires properly accounting for the regularization effect of codistillation, which we highlight through several empirical observations. Overall, this work contributes to a better understanding of codistillation and how to best take advantage of it in a distributed computing environment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题