论文标题
乐观的小索引政策:不安的土匪的在线学习
Optimistic Whittle Index Policy: Online Learning for Restless Bandits
论文作者
论文摘要
躁动不安的多臂匪(RMAB)扩展了多臂匪徒以允许状态的手臂,每个手臂的状态会随着手臂是否被拉动,随着不同的过渡而不安。解决RMAB需要有关过渡动力学的信息,这些动态通常是未知的。为了在具有未知过渡的RMAB设置中计划,我们使用上限置信度绑定(UCB)方法来学习过渡动力学,提出了基于Whittle索引策略的第一个在线学习算法。具体而言,我们估计过渡概率的置信度范围,并制定双线性程序,以使用这些估计值计算乐观的小指数。我们的算法,ucwhittle,实现了sublinear $ o(h \ sqrt {t \ log t})$频繁的遗憾,以$ t $ evientes以恒定的horizon $ h $中的$ t $ evine求解rmabs。从经验上讲,我们证明了UCWhittle利用RMAB的结构和Whittle Index策略解决方案比在三个领域的现有在线学习基线(包括由现实世界中的母体和育儿数据集构建的一个域)更好的性能。
Restless multi-armed bandits (RMABs) extend multi-armed bandits to allow for stateful arms, where the state of each arm evolves restlessly with different transitions depending on whether that arm is pulled. Solving RMABs requires information on transition dynamics, which are often unknown upfront. To plan in RMAB settings with unknown transitions, we propose the first online learning algorithm based on the Whittle index policy, using an upper confidence bound (UCB) approach to learn transition dynamics. Specifically, we estimate confidence bounds of the transition probabilities and formulate a bilinear program to compute optimistic Whittle indices using these estimates. Our algorithm, UCWhittle, achieves sublinear $O(H \sqrt{T \log T})$ frequentist regret to solve RMABs with unknown transitions in $T$ episodes with a constant horizon $H$. Empirically, we demonstrate that UCWhittle leverages the structure of RMABs and the Whittle index policy solution to achieve better performance than existing online learning baselines across three domains, including one constructed from a real-world maternal and childcare dataset.