得分与基于得分的游戏中的获胜率：哪种增强学习奖励？

论文标题

得分与基于得分的游戏中的获胜率：哪种增强学习奖励？

Score vs. Winrate in Score-Based Games: which Reward for Reinforcement Learning?

论文作者

Pasqualini, Luca, Amato, Gianluca, Fantozzi, Marco, Gini, Rosa, Marchetti, Alessandro, Metta, Carlo, Morandin, Francesco, Parton, Maurizio

论文摘要

在过去的几年中，DeepMind算法Alphazero已成为有效解决完美信息的最佳状态，并以双赢的成绩来处理两人零和零游戏。但是，当获胜/输的结果取决于最终得分差异时，Alphazero可能会进行得分 - 贝观的动作，因为从双赢的结果角度来看，所有获胜的最终位置都是等效的。这可能是一个问题，例如在用于教学时，或者试图了解是否有更好的举动。而且，对完美游戏的理论追求。一种幼稚的方法是训练类似α的代理商来预测得分差异，而不是赢/丢失结果。由于GO的游戏是确定性的，因此这应该产生最佳的比赛。但是，这是一种民间传说的信念，即“这不起作用”。在本文中，我们首先提供了这种信念的经验证据。然后，我们在一般完美信息中对这种次级次数的理论解释两人零和游戏，其中像GO这样的游戏的复杂性被环境的随机性所取代。我们表明，最佳的政策在获胜或失败时对不确定性有不同的偏好。特别是，在失去状态时，成果最佳的代理会选择导致更高分数差异的动作。然后，我们认为，当涉及近似值时，确定性游戏的行为就像是非确定的游戏，其中得分差异是由位置的不确定的模型。我们用人类专家在类似α的软件中验证了这一假设。

In the last years, the DeepMind algorithm AlphaZero has become the state of the art to efficiently tackle perfect information two-player zero-sum games with a win/lose outcome. However, when the win/lose outcome is decided by a final score difference, AlphaZero may play score-suboptimal moves because all winning final positions are equivalent from the win/lose outcome perspective. This can be an issue, for instance when used for teaching, or when trying to understand whether there is a better move. Moreover, there is the theoretical quest for the perfect game. A naive approach would be training an AlphaZero-like agent to predict score differences instead of win/lose outcomes. Since the game of Go is deterministic, this should as well produce an outcome-optimal play. However, it is a folklore belief that "this does not work". In this paper, we first provide empirical evidence for this belief. We then give a theoretical interpretation of this suboptimality in general perfect information two-player zero-sum game where the complexity of a game like Go is replaced by the randomness of the environment. We show that an outcome-optimal policy has a different preference for uncertainty when it is winning or losing. In particular, when in a losing state, an outcome-optimal agent chooses actions leading to a higher score variance. We then posit that when approximation is involved, a deterministic game behaves like a nondeterministic game, where the score variance is modeled by how uncertain the position is. We validate this hypothesis in AlphaZero-like software with a human expert.

下载PDF全文

下载文献需遵守相关版权规定

论文标题