在没有随机策略网络的情况下，找到连续性游戏的混合构成平衡

论文标题

在没有随机策略网络的情况下，找到连续性游戏的混合构成平衡

Finding mixed-strategy equilibria of continuous-action games without gradients using randomized policy networks

论文作者

Martin, Carlos, Sandholm, Tuomas

论文摘要

我们研究了计算连续性游戏的近似NASH平衡的问题，而无需访问梯度。这种游戏访问在增强学习设置中很常见，在这里，环境通常被视为黑匣子。为了解决这个问题，我们应用了将平滑的梯度估计器与平衡求合动力学相结合的零阶优化技术。我们使用人工神经网络对玩家的策略进行建模。特别是，我们使用随机策略网络来建模混合策略。除了观察外，这些噪声除了输入外，还可以灵活地代表任意观察依赖性的连续性分布。能够对这种混合策略进行建模，对于解决缺乏纯粹稳定平衡的连续性游戏至关重要。我们使用游戏理论的NASH收敛度量度量的近似来评估方法的性能，该指标衡量了从单方面改变其策略中可以受益多少。我们将我们的方法应用于连续上校的障碍游戏，单项和多项目拍卖以及可见性游戏。实验表明，我们的方法可以迅速找到高质量的近似平衡。此外，他们表明输入噪声的维度对于性能至关重要。据我们所知，本文是第一个以无限制的混合策略和没有任何梯度信息来解决一般连续行动游戏的第一个论文。

We study the problem of computing an approximate Nash equilibrium of continuous-action game without access to gradients. Such game access is common in reinforcement learning settings, where the environment is typically treated as a black box. To tackle this problem, we apply zeroth-order optimization techniques that combine smoothed gradient estimators with equilibrium-finding dynamics. We model players' strategies using artificial neural networks. In particular, we use randomized policy networks to model mixed strategies. These take noise in addition to an observation as input and can flexibly represent arbitrary observation-dependent, continuous-action distributions. Being able to model such mixed strategies is crucial for tackling continuous-action games that lack pure-strategy equilibria. We evaluate the performance of our method using an approximation of the Nash convergence metric from game theory, which measures how much players can benefit from unilaterally changing their strategy. We apply our method to continuous Colonel Blotto games, single-item and multi-item auctions, and a visibility game. The experiments show that our method can quickly find high-quality approximate equilibria. Furthermore, they show that the dimensionality of the input noise is crucial for performance. To our knowledge, this paper is the first to solve general continuous-action games with unrestricted mixed strategies and without any gradient information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题