在大深度制度中缩放重新排列

论文标题

在大深度制度中缩放重新排列

Scaling ResNets in the Large-depth Regime

论文作者

Marion, Pierre, Fermanian, Adeline, Biau, Gérard, Vert, Jean-Philippe

论文摘要

深度重新结合因实现最新的机器学习任务而被认可。但是，这些体系结构的显着性能取决于需要精心制作的培训程序，以避免消失或爆炸梯度，尤其是随着深度$ l $的增加。关于如何减轻此问题，尚未达成共识，尽管广泛讨论的策略在于将每一层的输出缩放为$α_l$。我们在概率设置中显示，使用标准I.I.D.〜初始化，唯一的非平凡动力学是$α_l= \ frac {1} {\ sqrt {l}} $;其他选择会导致爆炸或身份映射。该缩放因子在连续时间限制中对应于神经随机微分方程，这与广泛的解释相反，即深度重新连接是神经普通微分方程的离散化。相比之下，在后一种制度中，通过特定的相关初始化和$α_l= \ frac {1} {l} $获得稳定性。我们的分析表明，与层指数的函数之间的缩放缩放和规律性之间存在很强的相互作用。最后，在一系列实验中，我们表现出由这两个参数驱动的连续范围，这在训练之前和之后会共同影响性能。

Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $α_L$. We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for $α_L = \frac{1}{\sqrt{L}}$; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $α_L = \frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题