论文标题
针对决斗匪徒问题的渐近最佳批处理算法
An Asymptotically Optimal Batched Algorithm for the Dueling Bandit Problem
论文作者
论文摘要
我们研究了$ k $武装的决斗匪徒问题,这是传统的多武器匪徒问题的一种变体,其中以成对比较的形式获得了反馈。以前的学习算法专注于$ \ textit {完全自适应} $设置,在每次比较后,算法可以进行更新。 “批处理”的决斗匪徒问题是由Web搜索排名和推荐系统等大规模应用程序激励的,在这些应用程序中执行顺序更新可能是不可行的。在这项工作中,我们要问:$ \ textit {是否只使用几个自适应回合,与$ k $ armed的决斗强盗的最佳顺序算法的渐近遗憾界限?} $我们在肯定的$ \ \\ textit {在condorcet条件下{在condorcet条件} $中,$ k $ k $ k $ k的标准设置。我们在$ o(\ log(t))$ o(k \ log(t))$(\ log(t))$ rounds中获得$ o(k^2 \ log^2(k))$(k \ log(t))$的渐近遗憾,其中$ t $是时间范围。我们的遗憾界限几乎与在Condorcet条件下完全顺序环境中已知的最佳后悔界限相匹配。最后,在各种现实世界数据集的计算实验中,我们观察到使用$ o(\ log(t))$圆形的算法与完全顺序的算法(使用$ t $ rounds)的性能几乎相同。
We study the $K$-armed dueling bandit problem, a variation of the traditional multi-armed bandit problem in which feedback is obtained in the form of pairwise comparisons. Previous learning algorithms have focused on the $\textit{fully adaptive}$ setting, where the algorithm can make updates after every comparison. The "batched" dueling bandit problem is motivated by large-scale applications like web search ranking and recommendation systems, where performing sequential updates may be infeasible. In this work, we ask: $\textit{is there a solution using only a few adaptive rounds that matches the asymptotic regret bounds of the best sequential algorithms for $K$-armed dueling bandits?}$ We answer this in the affirmative $\textit{under the Condorcet condition}$, a standard setting of the $K$-armed dueling bandit problem. We obtain asymptotic regret of $O(K^2\log^2(K)) + O(K\log(T))$ in $O(\log(T))$ rounds, where $T$ is the time horizon. Our regret bounds nearly match the best regret bounds known in the fully sequential setting under the Condorcet condition. Finally, in computational experiments over a variety of real-world datasets, we observe that our algorithm using $O(\log(T))$ rounds achieves almost the same performance as fully sequential algorithms (that use $T$ rounds).