在外源全球马尔可夫进程下不安定的多军匪徒

论文标题

在外源全球马尔可夫进程下不安定的多军匪徒

Restless Multi-Armed Bandits under Exogenous Global Markov Process

论文作者

Gafni, Tomer, Yemini, Michal, Cohen, Kobi

论文摘要

我们考虑到不知名的手臂动力学的不安的多臂匪徒（RMAB）问题的扩展，其中未知的外源全球马尔可夫过程控制着每个手臂的奖励分布。在每个全球状态下，每个手臂的奖励过程都根据未知的马尔可夫规则而演变，这是不同手臂之间非相同的。每次，玩家都会从N手臂中选择一只手臂，并从有限的奖励状态中获得随机的奖励。手臂是躁动不安的，也就是说，无论球员的行动如何，他们的当地国家都会发展。在最近对相关RMAB设置的研究的激励中，遗憾被定义为知道问题动态的玩家的奖励损失，并在每次t手臂上发挥最大化的臂，从而最大程度地提高了预期的即时价值。目的是制定一项手臂选择政策，使遗憾最小化。为此，我们在外源马尔可夫过程（LEMP）算法下发展学习。我们从理论上分析LEMP，并建立了一个有限样本的束缚。我们表明，LEMP随着时间的推移实现了对数遗憾秩序。我们进一步分析了LEMP，并提出了支持理论发现的模拟结果，并证明LEMP明显优于替代算法。

We consider an extension to the restless multi-armed bandit (RMAB) problem with unknown arm dynamics, where an unknown exogenous global Markov process governs the rewards distribution of each arm. Under each global state, the rewards process of each arm evolves according to an unknown Markovian rule, which is non-identical among different arms. At each time, a player chooses an arm out of N arms to play, and receives a random reward from a finite set of reward states. The arms are restless, that is, their local state evolves regardless of the player's actions. Motivated by recent studies on related RMAB settings, the regret is defined as the reward loss with respect to a player that knows the dynamics of the problem, and plays at each time t the arm that maximizes the expected immediate value. The objective is to develop an arm-selection policy that minimizes the regret. To that end, we develop the Learning under Exogenous Markov Process (LEMP) algorithm. We analyze LEMP theoretically and establish a finite-sample bound on the regret. We show that LEMP achieves a logarithmic regret order with time. We further analyze LEMP numerically and present simulation results that support the theoretical findings and demonstrate that LEMP significantly outperforms alternative algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题