流畅的输出假设，以及为什么深网比宽的网络更好

论文标题

流畅的输出假设，以及为什么深网比宽的网络更好

The smooth output assumption, and why deep networks are better than wide ones

论文作者

Sa-Couto, Luis, Ramos, Jose Miguel, Wichert, Andreas

论文摘要

当几个模型具有相似的训练分数时，经典模型选择启发式方法遵循OCCAM的剃须刀，并建议选择容量最低的启发式方法。然而，具有大型神经网络的现代实践通常导致情况下，两个具有相同数量参数的网络在训练集上得分完全相同，但更深层次的网络更概括地概括了看不见的例子。考虑到这一点，深深的网络优于较浅的网络已被公认为。但是，从理论上讲，两者之间没有区别。实际上，它们都是通用近似值。在这项工作中，我们提出了一项新的无监督措施，以预测模型的推广程度。我们称其为输出清晰度，这是基于以下事实：实际上，概念之间的边界通常是不明的。我们在几种神经网络设置和体系结构上测试了这一新措施，并显示了我们的度量标准和测试集性能之间的相关性一般程度。建立了这一措施后，我们给出了一个数学概率论点，该论点预测网络深度与我们提出的度量相关。在实际数据中验证这一点后，我们能够提出工作的关键参数：输出锐度阻碍了概括；深层网络对此有内在的偏见；因此，深网击败了广泛的网络。总而言之，所有工作不仅提供了有助于过度拟合的预测指标，可以在实践中用于模型选择（甚至是正则化），而且还为现代深层神经网络的成功提供了急需的理论基础。

When several models have similar training scores, classical model selection heuristics follow Occam's razor and advise choosing the ones with least capacity. Yet, modern practice with large neural networks has often led to situations where two networks with exactly the same number of parameters score similar on the training set, but the deeper one generalizes better to unseen examples. With this in mind, it is well accepted that deep networks are superior to shallow wide ones. However, theoretically there is no difference between the two. In fact, they are both universal approximators. In this work we propose a new unsupervised measure that predicts how well a model will generalize. We call it the output sharpness, and it is based on the fact that, in reality, boundaries between concepts are generally unsharp. We test this new measure on several neural network settings, and architectures, and show how generally strong the correlation is between our metric, and test set performance. Having established this measure, we give a mathematical probabilistic argument that predicts network depth to be correlated with our proposed measure. After verifying this in real data, we are able to formulate the key argument of the work: output sharpness hampers generalization; deep networks have an in built bias against it; therefore, deep networks beat wide ones. All in all the work not only provides a helpful predictor of overfitting that can be used in practice for model selection (or even regularization), but also provides a much needed theoretical grounding for the success of modern deep neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题