论文标题

得分与基于得分的游戏中的获胜率:哪种增强学习奖励?

Score vs. Winrate in Score-Based Games: which Reward for Reinforcement Learning?

论文作者

Pasqualini, Luca, Amato, Gianluca, Fantozzi, Marco, Gini, Rosa, Marchetti, Alessandro, Metta, Carlo, Morandin, Francesco, Parton, Maurizio

论文摘要

在过去的几年中,DeepMind算法Alphazero已成为有效解决完美信息的最佳状态,并以双赢的成绩来处理两人零和零游戏。但是,当获胜/输的结果取决于最终得分差异时,Alphazero可能会进行得分 - 贝观的动作,因为从双赢的结果角度来看,所有获胜的最终位置都是等效的。这可能是一个问题,例如在用于教学时,或者试图了解是否有更好的举动。而且,对完美游戏的理论追求。一种幼稚的方法是训练类似α的代理商来预测得分差异,而不是赢/丢失结果。由于GO的游戏是确定性的,因此这应该产生最佳的比赛。但是,这是一种民间传说的信念,即“这不起作用”。 在本文中,我们首先提供了这种信念的经验证据。然后,我们在一般完美信息中对这种次级次数的理论解释两人零和游戏,其中像GO这样的游戏的复杂性被环境的随机性所取代。我们表明,最佳的政策在获胜或失败时对不确定性有不同的偏好。特别是,在失去状态时,成果最佳的代理会选择导致更高分数差异的动作。然后,我们认为,当涉及近似值时,确定性游戏的行为就像是非确定的游戏,其中得分差异是由位置的不确定的模型。我们用人类专家在类似α的软件中验证了这一假设。

In the last years, the DeepMind algorithm AlphaZero has become the state of the art to efficiently tackle perfect information two-player zero-sum games with a win/lose outcome. However, when the win/lose outcome is decided by a final score difference, AlphaZero may play score-suboptimal moves because all winning final positions are equivalent from the win/lose outcome perspective. This can be an issue, for instance when used for teaching, or when trying to understand whether there is a better move. Moreover, there is the theoretical quest for the perfect game. A naive approach would be training an AlphaZero-like agent to predict score differences instead of win/lose outcomes. Since the game of Go is deterministic, this should as well produce an outcome-optimal play. However, it is a folklore belief that "this does not work". In this paper, we first provide empirical evidence for this belief. We then give a theoretical interpretation of this suboptimality in general perfect information two-player zero-sum game where the complexity of a game like Go is replaced by the randomness of the environment. We show that an outcome-optimal policy has a different preference for uncertainty when it is winning or losing. In particular, when in a losing state, an outcome-optimal agent chooses actions leading to a higher score variance. We then posit that when approximation is involved, a deterministic game behaves like a nondeterministic game, where the score variance is modeled by how uncertain the position is. We validate this hypothesis in AlphaZero-like software with a human expert.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源