论文标题
计算有限亨土匪的经典索引
Computing a classic index for finite-horizon bandits
论文作者
论文摘要
本文考虑了有限的Horizon离散状态匪徒对Gittins指数的有效精确计算,该匪徒对每个初始状态进行了测量,这是平均生产率,由预期的总折扣总奖励与预期的总折现时间的最大比率给出,这可以通过给定的地平线连续停止。除了表征有限马武器单臂匪徒问题的最佳政策外,此类索引还为棘手的有限有限的多型匪徒匪徒问题提供了次优的启发式索引规则,代表了Gittins索引规则的自然扩展(在Infinite-Horizon案例中是最佳的)。尽管在1950年代经典作品中引入了这样的有限马指数,但对其有效的精确计算的调查引起了很少的关注。本文仅使用算术操作介绍了递归自适应元素算法,该操作在问题参数(项目状态的数量和时间范围的长度)中计算(伪 - )多项式时间中的索引。在每个状态过渡有限的项目的特殊情况下,复杂性要么降低或仅取决于时间范围的长度。在针对常规校准方法的计算研究中,对所提出的算法进行了基准测试。
This paper considers the efficient exact computation of the counterpart of the Gittins index for a finite-horizon discrete-state bandit, which measures for each initial state the average productivity, given by the maximum ratio of expected total discounted reward earned to expected total discounted time expended that can be achieved through a number of successive plays stopping by the given horizon. Besides characterizing optimal policies for the finite-horizon one-armed bandit problem, such an index provides a suboptimal heuristic index rule for the intractable finite-horizon multiarmed bandit problem, which represents the natural extension of the Gittins index rule (optimal in the infinite-horizon case). Although such a finite-horizon index was introduced in classic work in the 1950s, investigation of its efficient exact computation has received scant attention. This paper introduces a recursive adaptive-greedy algorithm using only arithmetic operations that computes the index in (pseudo-)polynomial time in the problem parameters (number of project states and time horizon length). In the special case of a project with limited transitions per state, the complexity is either reduced or depends only on the length of the time horizon. The proposed algorithm is benchmarked in a computational study against the conventional calibration method.