论文标题

无限躁动的土匪的近乎狂热性

Near-optimality for infinite-horizon restless bandits with many arms

论文作者

Zhang, Xiangyu, Frazier, Peter I.

论文摘要

不安的土匪是推荐系统,主动学习,收入管理和其他领域的应用程序的重要类别。我们认为,无限的野蛮土匪打折的无限雄鹿,在每个时期可能会拉动固定比例的手臂,而手臂共享有限的状态空间。尽管可以通过随机动态编程来计算平均最佳策略,但所需的计算随臂$ n $的数量而成倍增长。因此,重要的是要找到可以为大型$ n $进行有效计算的可扩展政策,并且在该制度中几乎是最佳的政策,从某种意义上说,最佳差距(即,每只手臂的预期性能损失)对于大$ n $而消失。但是,最受欢迎的方法是Whittle Index,需要一个难以验证的索引条件,并需要明确定义,并难以验证的条件,以保证$ O(N)$最佳差距。我们提出了解决这些困难的方法。通过替换Whittle Index用一系列Lagrangian乘法器使用的全局Lagrange乘法器,每个时间段一个至有限的截断点,我们得出了一类称为流体余量策略的策略,这些策略具有$ O(\ sqrt {n})$ OPTILALITY $最佳gap。与Whittle索引不同,流体 - 平衡策略不需要定义的索引性,并且其$ O(\ sqrt {n})$ optimal Gap Bound Bound Bound Bonds compersity noter compersity noter compersity nower nowere of to nitive Babse blalance策略策略均不需要足够的条件。我们还从经验上证明,流体平衡政策在特定问题上提供了最先进的绩效。

Restless bandits are an important class of problems with applications in recommender systems, active learning, revenue management and other areas. We consider infinite-horizon discounted restless bandits with many arms where a fixed proportion of arms may be pulled in each period and where arms share a finite state space. Although an average-case-optimal policy can be computed via stochastic dynamic programming, the computation required grows exponentially with the number of arms $N$. Thus, it is important to find scalable policies that can be computed efficiently for large $N$ and that are near optimal in this regime, in the sense that the optimality gap (i.e. the loss of expected performance against an optimal policy) per arm vanishes for large $N$. However, the most popular approach, the Whittle index, requires a hard-to-verify indexability condition to be well-defined and another hard-to-verify condition to guarantee a $o(N)$ optimality gap. We present a method resolving these difficulties. By replacing a global Lagrange multiplier used by the Whittle index with a sequence of Lagrangian multipliers, one per time period up to a finite truncation point, we derive a class of policies, called fluid-balance policies, that have a $O(\sqrt{N})$ optimality gap. Unlike the Whittle index, fluid-balance policies do not require indexability to be well-defined and their $O(\sqrt{N})$ optimality gap bound holds universally without sufficient conditions. We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源