山脊骑手：通过遵循Hessian的特征向量找到多样的解决方案

论文标题

山脊骑手：通过遵循Hessian的特征向量找到多样的解决方案

Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian

论文作者

Parker-Holder, Jack, Metz, Luke, Resnick, Cinjon, Hu, Hengyuan, Lerer, Adam, Letcher, Alistair, Peysakhovich, Alex, Pacchiano, Aldo, Foerster, Jakob

论文摘要

在过去的十年中，单个算法改变了我们生活的许多方面 - 随机梯度下降（SGD）。在不断减少的损失功能的时代，SGD及其各种后代已成为机器学习中的首选优化工具，并且是深神经网络（DNNS）成功的关键组成部分。尽管保证SGD会收敛到局部最佳（在宽松的假设下），但在某些情况下可能会发现哪种局部最佳最佳，这通常与上下文有关。在机器学习中，经常出现示例，从形状 - 文本功能到集合方法和零射击协调。在这些设置中，有一些所需的解决方案，这些解决方案将无法找到“标准”损耗功能上的SGD，因为它会收敛到“易于”的解决方案。在本文中，我们提出了一种不同的方法。我们没有遵循与本地贪婪方向相对应的梯度，而是跟随Hessian的特征向量，我们称之为“山脊”。通过迭代遵循和分支之间的分支，我们有效地跨越了损耗表面以找到质量不同的解决方案。我们在理论上和实验上都表明我们的方法称为Ridge Rider（RR），为各种具有挑战性的问题提供了一个有希望的方向。

Over the last decade, a single algorithm has changed many facets of our lives - Stochastic Gradient Descent (SGD). In the era of ever decreasing loss functions, SGD and its various offspring have become the go-to optimization tool in machine learning and are a key component of the success of deep neural networks (DNNs). While SGD is guaranteed to converge to a local optimum (under loose assumptions), in some cases it may matter which local optimum is found, and this is often context-dependent. Examples frequently arise in machine learning, from shape-versus-texture-features to ensemble methods and zero-shot coordination. In these settings, there are desired solutions which SGD on 'standard' loss functions will not find, since it instead converges to the 'easy' solutions. In this paper, we present a different approach. Rather than following the gradient, which corresponds to a locally greedy direction, we instead follow the eigenvectors of the Hessian, which we call "ridges". By iteratively following and branching amongst the ridges, we effectively span the loss surface to find qualitatively different solutions. We show both theoretically and experimentally that our method, called Ridge Rider (RR), offers a promising direction for a variety of challenging problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题