针对决斗匪徒问题的渐近最佳批处理算法

论文标题

针对决斗匪徒问题的渐近最佳批处理算法

An Asymptotically Optimal Batched Algorithm for the Dueling Bandit Problem

论文作者

Agarwal, Arpit, Ghuge, Rohan, Nagarajan, Viswanath

论文摘要

我们研究了$ k $武装的决斗匪徒问题，这是传统的多武器匪徒问题的一种变体，其中以成对比较的形式获得了反馈。以前的学习算法专注于$ \ textit {完全自适应} $设置，在每次比较后，算法可以进行更新。 “批处理”的决斗匪徒问题是由Web搜索排名和推荐系统等大规模应用程序激励的，在这些应用程序中执行顺序更新可能是不可行的。在这项工作中，我们要问：$ \ textit {是否只使用几个自适应回合，与$ k $ armed的决斗强盗的最佳顺序算法的渐近遗憾界限？} $我们在肯定的$ \ \\ textit {在condorcet条件下{在condorcet条件} $中，$ k $ k $ k $ k的标准设置。我们在$ o（\ log（t））$ o（k \ log（t））$（\ log（t））$ rounds中获得$ o（k^2 \ log^2（k））$（k \ log（t））$的渐近遗憾，其中$ t $是时间范围。我们的遗憾界限几乎与在Condorcet条件下完全顺序环境中已知的最佳后悔界限相匹配。最后，在各种现实世界数据集的计算实验中，我们观察到使用$ o（\ log（t））$圆形的算法与完全顺序的算法（使用$ t $ rounds）的性能几乎相同。

We study the $K$-armed dueling bandit problem, a variation of the traditional multi-armed bandit problem in which feedback is obtained in the form of pairwise comparisons. Previous learning algorithms have focused on the $\textit{fully adaptive}$ setting, where the algorithm can make updates after every comparison. The "batched" dueling bandit problem is motivated by large-scale applications like web search ranking and recommendation systems, where performing sequential updates may be infeasible. In this work, we ask: $\textit{is there a solution using only a few adaptive rounds that matches the asymptotic regret bounds of the best sequential algorithms for $K$-armed dueling bandits?}$ We answer this in the affirmative $\textit{under the Condorcet condition}$, a standard setting of the $K$-armed dueling bandit problem. We obtain asymptotic regret of $O(K^2\log^2(K)) + O(K\log(T))$ in $O(\log(T))$ rounds, where $T$ is the time horizon. Our regret bounds nearly match the best regret bounds known in the fully sequential setting under the Condorcet condition. Finally, in computational experiments over a variety of real-world datasets, we observe that our algorithm using $O(\log(T))$ rounds achieves almost the same performance as fully sequential algorithms (that use $T$ rounds).

下载PDF全文

下载文献需遵守相关版权规定

论文标题