论文标题

平坦是一个假朋友

Flatness is a False Friend

论文作者

Granziol, Diego

论文摘要

基于Hessian的平坦度量,例如痕量,Frobenius和光谱规范,已被争论,使用并证明与概括有关。在本文中,我们证明,对于跨熵损失下的饲料前向神经网络,我们希望具有较大权重的低损失解决方案具有小的基于黑森的平坦度量。这意味着,使用$ l2 $正则化获得的解决方案原则上应该比没有概括的解决方案要比没有的解决方案要大。我们证明这对于逻辑回归,多层感知器,简单的卷积,预激活和宽阔的残留网络以及cifar- $ 100 $数据集是正确的。此外,我们表明,对于使用迭代平均值的自适应优化算法,在VGG- $ 16 $网络和CIFAR- $ 100 $数据集中,可以实现对SGD的卓越概括,但$ 30 \ times $ sharpper。这一理论发现,加上实验结果,对在泛化的讨论中提出了有关基于黑森西亚的清晰度措施的有效性的严重问题。我们进一步表明,Hessian等级可以由恒定时间的神经元数量乘以类数量的界限,而实际上,这通常是网络参数的一小部分。这解释了一个奇怪的观察,即许多Hessian特征值是零或非常接近零,这在文献中已有报道。

Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness. This implies that solutions obtained using $L2$ regularisation should in principle be sharper than those without, despite generalising better. We show this to be true for logistic regression, multi-layer perceptrons, simple convolutional, pre-activated and wide residual networks on the MNIST and CIFAR-$100$ datasets. Furthermore, we show that for adaptive optimisation algorithms using iterate averaging, on the VGG-$16$ network and CIFAR-$100$ dataset, achieve superior generalisation to SGD but are $30 \times$ sharper. This theoretical finding, along with experimental results, raises serious questions about the validity of Hessian based sharpness measures in the discussion of generalisation. We further show that the Hessian rank can be bounded by the a constant times number of neurons multiplied by the number of classes, which in practice is often a small fraction of the network parameters. This explains the curious observation that many Hessian eigenvalues are either zero or very near zero which has been reported in the literature.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源