迭代平均寻求最佳测试错误

论文标题

迭代平均寻求最佳测试错误

Iterative Averaging in the Quest for Best Test Error

论文作者

Granziol, Diego, Wan, Xingchen, Albanie, Samuel, Roberts, Stephen

论文摘要

我们使用高维二次二次二次二次风险表面之间的高斯工艺扰动模型分析并解释了迭代平均的概括性能的提高。我们得出了三个现象\最新deedits {从我们的理论结果中：}（1）将迭代平均（IA）与较大的学习率和正则化以改善正则化的重要性。（2）平均频率较低的理由。（3）我们期望自适应梯度方法能够与非自适应对应物相比，迭代平均效果同样更好或更好。受这些结果的启发\最新的{，以及}关于适当正则化对迭代溶液多样性的重要性的实证研究，我们提出了两种具有迭代平均值的自适应算法。与随机梯度下降（SGD）相比，这些结果明显更好，需要更少的调整，并且不需要尽早停止或验证集监测。我们在CIFAR-10/100，ImageNet和Penn Treebank数据集上展示了方法在各种现代和古典网络架构上的功效。

We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena \latestEdits{from our theoretical results:} (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved regularisation. (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well, or better, with iterate averaging than their non-adaptive counterparts. Inspired by these results\latestEdits{, together with} empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. These give significantly better results compared to stochastic gradient descent (SGD), require less tuning and do not require early stopping or validation set monitoring. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets on a variety of modern and classical network architectures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题