无限躁动的土匪的近乎狂热性

论文标题

无限躁动的土匪的近乎狂热性

Near-optimality for infinite-horizon restless bandits with many arms

论文作者

Zhang, Xiangyu, Frazier, Peter I.

论文摘要

不安的土匪是推荐系统，主动学习，收入管理和其他领域的应用程序的重要类别。我们认为，无限的野蛮土匪打折的无限雄鹿，在每个时期可能会拉动固定比例的手臂，而手臂共享有限的状态空间。尽管可以通过随机动态编程来计算平均最佳策略，但所需的计算随臂$ n $的数量而成倍增长。因此，重要的是要找到可以为大型$ n $进行有效计算的可扩展政策，并且在该制度中几乎是最佳的政策，从某种意义上说，最佳差距（即，每只手臂的预期性能损失）对于大$ n $而消失。但是，最受欢迎的方法是Whittle Index，需要一个难以验证的索引条件，并需要明确定义，并难以验证的条件，以保证$ O（N）$最佳差距。我们提出了解决这些困难的方法。通过替换Whittle Index用一系列Lagrangian乘法器使用的全局Lagrange乘法器，每个时间段一个至有限的截断点，我们得出了一类称为流体余量策略的策略，这些策略具有$ O（\ sqrt {n}）$ OPTILALITY $最佳gap。与Whittle索引不同，流体 - 平衡策略不需要定义的索引性，并且其$ O（\ sqrt {n}）$ optimal Gap Bound Bound Bound Bonds compersity noter compersity noter compersity nower nowere of to nitive Babse blalance策略策略均不需要足够的条件。我们还从经验上证明，流体平衡政策在特定问题上提供了最先进的绩效。

Restless bandits are an important class of problems with applications in recommender systems, active learning, revenue management and other areas. We consider infinite-horizon discounted restless bandits with many arms where a fixed proportion of arms may be pulled in each period and where arms share a finite state space. Although an average-case-optimal policy can be computed via stochastic dynamic programming, the computation required grows exponentially with the number of arms $N$. Thus, it is important to find scalable policies that can be computed efficiently for large $N$ and that are near optimal in this regime, in the sense that the optimality gap (i.e. the loss of expected performance against an optimal policy) per arm vanishes for large $N$. However, the most popular approach, the Whittle index, requires a hard-to-verify indexability condition to be well-defined and another hard-to-verify condition to guarantee a $o(N)$ optimality gap. We present a method resolving these difficulties. By replacing a global Lagrange multiplier used by the Whittle index with a sequence of Lagrangian multipliers, one per time period up to a finite truncation point, we derive a class of policies, called fluid-balance policies, that have a $O(\sqrt{N})$ optimality gap. Unlike the Whittle index, fluid-balance policies do not require indexability to be well-defined and their $O(\sqrt{N})$ optimality gap bound holds universally without sufficient conditions. We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题