论文标题
近似可剥削性:在大型游戏中学习最佳响应
Approximate exploitability: Learning a best response in large games
论文作者
论文摘要
研究人员表明,神经网络容易受到对抗性示例和微妙的环境变化的影响,这两者都可以视为分配变化的一种形式。对于人类来说,产生的错误看起来像是失误,侵蚀了对这些代理商的信任。在先前的游戏研究中,代理评估通常集中在实践中的游戏结果上。尽管有价值,但这种评估通常无法评估对最坏情况的鲁棒性。先前在计算机扑克方面的研究已经检查了如何确切且大致评估这种最坏情况的性能。不幸的是,精确的计算与较大的域是不可行的,现有的近似依赖于特定于扑克的知识。我们介绍了ISMCTS-BR,这是一种可扩展的基于搜索的深度强化学习算法,用于学习对代理的最佳响应,从而近似最差的性能。我们在几个两名玩家的零和游戏中都针对各种代理商(包括几个基于Alphazero的代理商)演示了该技术。
Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically fails to evaluate robustness to worst-case outcomes. Prior research in computer poker has examined how to assess such worst-case performance, both exactly and approximately. Unfortunately, exact computation is infeasible with larger domains, and existing approximations rely on poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, thereby approximating worst-case performance. We demonstrate the technique in several two-player zero-sum games against a variety of agents, including several AlphaZero-based agents.