联合政策搜索与不完美信息的多代理协作

论文标题

联合政策搜索与不完美信息的多代理协作

Joint Policy Search for Multi-agent Collaboration with Imperfect Information

论文作者

Tian, Yuandong, Gong, Qucheng, Jiang, Tina

论文摘要

为了学习与不完美信息的多代理协作的良好联合政策仍然是一个基本挑战。对于两个玩家零和游戏的游戏，坐标的方法（一次优化一个代理商的策略，例如自我播放）可以保证，在多机构合作环境中，它们通常会汇聚为次优的NASH平衡。另一方面，由于策略的复杂相互作用（例如，上游更新会影响下游状态可达性），直接对信息游戏中的联合策略变化进行了直接建模。在本文中，我们显示游戏值的全局变化可以分解为在每个信息集上本地定位的策略变化，并具有一个新颖的术语策略变化密度。基于此，我们建议联合政策搜索（JPS），迭代地改善了不完美的信息游戏中协作代理的联合政策，而无需重新评估整个游戏。在多代理协作表格游戏中，JPS已被证明永远不会恶化性能，并且可以改善单方面方法（例如CFR）提供的解决方案，表现优于为协作政策学习（例如BAD）设计的算法。此外，对于现实世界中的游戏，JPS具有在线表单，该表单自然可以与梯度更新联系在一起。我们将其测试到Contract Bridge，这是一款4播放器不完美的信息游戏，其中一支$ 2 $的团队合作与对方竞争。在竞标阶段，玩家依次竞标通过有限的信息渠道找到一份良好的合同。基于强大的基线代理，该代理纯粹是通过域名竞争性桥梁，纯粹是通过域名无关的自我扮演，JPS改善了团队合作的合作，胜过冠军冠军软件WBRIDGE5的合作，用$+0.63 $+0.63 $ IMPS（国际匹配点）每个板上的游戏超过1K游戏，这基本上比以前的SOTA（$+0.41 $+IMPS/BIMPS/B）评估了。

To learn good joint policies for multi-agent collaboration with imperfect information remains a fundamental challenge. While for two-player zero-sum games, coordinate-ascent approaches (optimizing one agent's policy at a time, e.g., self-play) work with guarantees, in multi-agent cooperative setting they often converge to sub-optimal Nash equilibrium. On the other hand, directly modeling joint policy changes in imperfect information game is nontrivial due to complicated interplay of policies (e.g., upstream updates affect downstream state reachability). In this paper, we show global changes of game values can be decomposed to policy changes localized at each information set, with a novel term named policy-change density. Based on this, we propose Joint Policy Search(JPS) that iteratively improves joint policies of collaborative agents in imperfect information games, without re-evaluating the entire game. On multi-agent collaborative tabular games, JPS is proven to never worsen performance and can improve solutions provided by unilateral approaches (e.g, CFR), outperforming algorithms designed for collaborative policy learning (e.g. BAD). Furthermore, for real-world games, JPS has an online form that naturally links with gradient updates. We test it to Contract Bridge, a 4-player imperfect-information game where a team of $2$ collaborates to compete against the other. In its bidding phase, players bid in turn to find a good contract through a limited information channel. Based on a strong baseline agent that bids competitive bridge purely through domain-agnostic self-play, JPS improves collaboration of team players and outperforms WBridge5, a championship-winning software, by $+0.63$ IMPs (International Matching Points) per board over 1k games, substantially better than previous SoTA ($+0.41$ IMPs/b) under Double-Dummy evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题