论文标题
广泛的游戏中相关平衡的样品学习有效学习
Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games
论文作者
论文摘要
不完美的信息广泛形式游戏(IIEFGS)是涉及不完美信息和顺序游戏的真实游戏的普遍模型。广泛的相关平衡(EFCE)已被认为是多玩家通用和iiefgs的天然解决方案概念。但是,现有的寻找EFCE的算法需要游戏中的全部反馈,并且在更具挑战性的强盗反馈环境中,它仍然是开放的,如何有效地学习EFCE,在这种情况下,只能通过重复玩耍的观察来学习游戏。 本文介绍了第一种用于从匪徒反馈中学习EFCE的样品效率算法。我们首先提出$ k $ -efce-一个更概括的定义,允许玩家以$ k $ times观察并偏离建议的动作。 $ k $ - efce将EFCE作为$ k = 1 $的特殊情况,并且随着$ k $的增加而变得越来越严格的平衡概念。然后,我们设计了一种未耦合的无regret算法,该算法在$ \ widetilde {\ mathcal {\ mathcal {o}}(\ max_ {\ max_ {i} x_ia_i_i_i_i^i^k}/\ varepsilon^$ ye $ terforce Infurce和Fulter Infult in where中$ a_i $是$ i $ th播放器的信息集和操作数量。我们的算法通过最大程度地减少考虑所有可能的建议历史的信息集的广泛遗憾。最后,我们设计了一种基于样本的算法变体,该变体在$ \ \ widetilde {\ natercal {\ mathcal {o}}(\ max_ {i} x_ia_i_i_i^{x_ia_i^{k+1}/\ varepsilon^2)$ play feft in in plays of sistiT in of plays of plotiate ingorithm中学习了$ \ varepsilon $ -approximate $ k $ -efce {\ Mathcal {o}}(\ max_}当专门为$ k = 1 $时,这给出了第一个从匪徒反馈中学习EFCE的样本效率算法。
Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for real-world games involving imperfect information and sequential plays. The Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural solution concept for multi-player general-sum IIEFGs. However, existing algorithms for finding an EFCE require full feedback from the game, and it remains open how to efficiently learn the EFCE in the more challenging bandit feedback setting where the game can only be learned by observations from repeated playing. This paper presents the first sample-efficient algorithm for learning the EFCE from bandit feedback. We begin by proposing $K$-EFCE -- a more generalized definition that allows players to observe and deviate from the recommended actions for $K$ times. The $K$-EFCE includes the EFCE as a special case at $K=1$, and is an increasingly stricter notion of equilibrium as $K$ increases. We then design an uncoupled no-regret algorithm that finds an $\varepsilon$-approximate $K$-EFCE within $\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K}/\varepsilon^2)$ iterations in the full feedback setting, where $X_i$ and $A_i$ are the number of information sets and actions for the $i$-th player. Our algorithm works by minimizing a wide-range regret at each information set that takes into account all possible recommendation histories. Finally, we design a sample-based variant of our algorithm that learns an $\varepsilon$-approximate $K$-EFCE within $\widetilde{\mathcal{O}}(\max_{i}X_iA_i^{K+1}/\varepsilon^2)$ episodes of play in the bandit feedback setting. When specialized to $K=1$, this gives the first sample-efficient algorithm for learning EFCE from bandit feedback.