论文标题
对广义线性上下文匪徒的双重双重强大的汤普森抽样
Double Doubly Robust Thompson Sampling for Generalized Linear Contextual Bandits
论文作者
论文摘要
我们提出了一种新颖的上下文强盗算法,以使用$ \ tilde {o}(\ sqrt {\ sqrt {κ^{ - 1} ϕt})$遗憾的是$ t $ rounds上的$ t $ neviance varive varive varive varive的最低限制。在几种实际情况下,$ ϕ = o(d)$,我们的结果是使用$ \ sqrt {d} $的广义线性模型(GLM)匪徒的第一个遗憾,而无需依赖Auer [2002]的方法。我们使用一个称为双重运动估算器的新型估计器(DOULE BOBUST)估算器,这是双重稳定(DR)估算器的子类,但误差较紧。 Auer [2002]的方法通过丢弃观察到的奖励来实现独立性,而我们的算法则在考虑所有使用DDR估计器的情况下实现了独立性。我们还提供了一个$ O(κ^{ - 1} ϕ \ log(nt)\ log t)$遗憾在概率边缘条件下以$ n $武器约束。 Bastani和Bayati [2020]和Bastani等人给出了遗憾的界限。 [2021]在环境中,所有臂都是共同的,但系数是特定的。当所有臂的上下文都不同,但系数很常见时,我们的第一个遗憾是在线性模型或GLM的边缘条件下结合的。我们使用合成数据和真实实例进行了经验研究,证明了我们的算法的有效性。
We propose a novel contextual bandit algorithm for generalized linear rewards with an $\tilde{O}(\sqrt{κ^{-1} ϕT})$ regret over $T$ rounds where $ϕ$ is the minimum eigenvalue of the covariance of contexts and $κ$ is a lower bound of the variance of rewards. In several practical cases where $ϕ=O(d)$, our result is the first regret bound for generalized linear model (GLM) bandits with the order $\sqrt{d}$ without relying on the approach of Auer [2002]. We achieve this bound using a novel estimator called double doubly-robust (DDR) estimator, a subclass of doubly-robust (DR) estimator but with a tighter error bound. The approach of Auer [2002] achieves independence by discarding the observed rewards, whereas our algorithm achieves independence considering all contexts using our DDR estimator. We also provide an $O(κ^{-1} ϕ\log (NT) \log T)$ regret bound for $N$ arms under a probabilistic margin condition. Regret bounds under the margin condition are given by Bastani and Bayati [2020] and Bastani et al. [2021] under the setting that contexts are common to all arms but coefficients are arm-specific. When contexts are different for all arms but coefficients are common, ours is the first regret bound under the margin condition for linear models or GLMs. We conduct empirical studies using synthetic data and real examples, demonstrating the effectiveness of our algorithm.