论文标题
循环时间分析Kullback-Leibler上置信度范围,可用于多个戏剧和马尔可奖的最佳自适应分配
Finite-Time Analysis of Round-Robin Kullback-Leibler Upper Confidence Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian Rewards
论文作者
论文摘要
我们研究了经典的随机多军匪徒问题的扩展,该问题涉及在其余的土匪设置中的多个戏剧和马尔可夫奖励。为了解决此问题,我们考虑了一个自适应分配规则,该规则在每个阶段都结合了所有臂的样本平均值的信息,以及单臂的kullback-leibler上部置信度界限,该置信是以圆形旋转方式选择的。对于由马尔可夫连锁店的一个参数指数家族产生的奖励,我们为这一适应性分配规则产生的遗憾提供了有限的上限,这揭示了对遗憾对时间范围的对数依赖性,并且在时间范围内均不出色。在我们的分析中,我们为马尔可夫连锁店设计了几个集中度结果,包括马尔可夫连锁店的最大不平等现象,这本身可能引起人们的关注。作为我们分析的副产品,我们还为多次戏剧的情况和I.I.D。从一个参数的概率密度家庭家族中获取的奖励。此外,我们提供的仿真结果说明以圆形旋转方式计算kullback-leibler上置信度范围,比每轮计算每个手臂计算它们的效率要高得多,而这两种方法的预期遗憾也同样。
We study an extension of the classic stochastic multi-armed bandit problem which involves multiple plays and Markovian rewards in the rested bandits setting. In order to tackle this problem we consider an adaptive allocation rule which at each stage combines the information from the sample means of all the arms, with the Kullback-Leibler upper confidence bound of a single arm which is selected in round-robin way. For rewards generated from a one-parameter exponential family of Markov chains, we provide a finite-time upper bound for the regret incurred from this adaptive allocation rule, which reveals the logarithmic dependence of the regret on the time horizon, and which is asymptotically optimal. For our analysis we devise several concentration results for Markov chains, including a maximal inequality for Markov chains, that may be of interest in their own right. As a byproduct of our analysis we also establish asymptotically optimal, finite-time guarantees for the case of multiple plays, and i.i.d. rewards drawn from a one-parameter exponential family of probability densities. Additionally, we provide simulation results that illustrate that calculating Kullback-Leibler upper confidence bounds in a round-robin way, is significantly more efficient than calculating them for every arm at each round, and that the expected regrets of those two approaches behave similarly.