论文标题
改进了物流匪徒的乐观算法
Improved Optimistic Algorithms for Logistic Bandits
论文作者
论文摘要
近年来,通过扩展了众所周知的线性设置并允许建模更丰富的奖励结构,广泛的线性匪徒框架引起了很多关注。它显着涵盖了逻辑模型,当奖励是二进制时广泛使用的逻辑模型。对于逻辑匪徒,现有算法的常见遗憾保证是$ \ tilde {\ Mathcal {o}}(κ\ sqrt {t})$,其中$κ$是问题依赖的常数。不幸的是,$κ$可以任意大,因为它随着决策集的规模成倍扩展。这可能会导致遗憾的界限明显放松,经验表现不佳。在这项工作中,我们研究了物流匪徒,重点是$κ$引入的过于依赖。我们提出了一种基于对奖励函数的非线性的精心检查,提出了一种新的乐观算法。我们表明,它享受$ \ tilde {\ Mathcal {o}}(\ sqrt {t})$遗憾的是,$κ$中没有依赖性,但在二阶术语中。我们的分析是基于独立兴趣的自称为Martingales的新的尾巴变性。
The generalized linear bandit framework has attracted a lot of attention in recent years by extending the well-understood linear setting and allowing to model richer reward structures. It notably covers the logistic model, widely used when rewards are binary. For logistic bandits, the frequentist regret guarantees of existing algorithms are $\tilde{\mathcal{O}}(κ\sqrt{T})$, where $κ$ is a problem-dependent constant. Unfortunately, $κ$ can be arbitrarily large as it scales exponentially with the size of the decision set. This may lead to significantly loose regret bounds and poor empirical performance. In this work, we study the logistic bandit with a focus on the prohibitive dependencies introduced by $κ$. We propose a new optimistic algorithm based on a finer examination of the non-linearities of the reward function. We show that it enjoys a $\tilde{\mathcal{O}}(\sqrt{T})$ regret with no dependency in $κ$, but for a second order term. Our analysis is based on a new tail-inequality for self-normalized martingales, of independent interest.