大自然腐败的土匪：遗憾和强大的乐观算法的下限

论文标题

大自然腐败的土匪：遗憾和强大的乐观算法的下限

Bandits Corrupted by Nature: Lower Bounds on Regret and Robust Optimistic Algorithm

论文作者

Basu, Debabrota, Maillard, Odalric-Ambrym, Mathieu, Timothée

论文摘要

我们研究了损坏的强盗问题，即带有$ k $未知奖励分布的随机多武器匪徒问题，这些问题是由历史无关的对手或自然而重尾和腐败的。具体而言，通过弹奏手臂获得的奖励来自（0.5,1] $的概率$ 1- \ VAREPSILON \以及[0.5,1] $的概率$ 1- \ VAREPSILON \以及无限制支持的任意腐败分布，并具有概率$ \ varepsilon \ in [0,0.5）$。首先，我们提供$ \ textit {与任何损坏的强盗算法的遗憾有关的问题依赖的下限。下限表明，损坏的匪徒问题比次高斯或重尾奖励的经典随机匪徒问题要困难。随后，我们为损坏的土匪提出了一种新颖的UCB型算法，即Hubucb，这是基于Huber的估计器来实现强大的平均估计。利用新的Huber估计量不平等，我们证明了Hubucb实现了近乎最理想的遗憾上限。由于计算Huber的估计器具有二次复杂性，因此我们进一步介绍了Huber估算器的顺序版本，该版本表现出线性复杂性。我们利用这种顺序估计器来设计seqhubucb，在减轻计算负担的同时，它具有类似的遗憾。最后，我们在实验上说明了Hubucb和Seqhubucb在解决损坏的土匪方面的效率，以解决不同的奖励分布和不同水平的腐败。

We study the corrupted bandit problem, i.e. a stochastic multi-armed bandit problem with $k$ unknown reward distributions, which are heavy-tailed and corrupted by a history-independent adversary or Nature. To be specific, the reward obtained by playing an arm comes from corresponding heavy-tailed reward distribution with probability $1-\varepsilon \in (0.5,1]$ and an arbitrary corruption distribution of unbounded support with probability $\varepsilon \in [0,0.5)$. First, we provide $\textit{a problem-dependent lower bound on the regret}$ of any corrupted bandit algorithm. The lower bounds indicate that the corrupted bandit problem is harder than the classical stochastic bandit problem with sub-Gaussian or heavy-tail rewards. Following that, we propose a novel UCB-type algorithm for corrupted bandits, namely HubUCB, that builds on Huber's estimator for robust mean estimation. Leveraging a novel concentration inequality of Huber's estimator, we prove that HubUCB achieves a near-optimal regret upper bound. Since computing Huber's estimator has quadratic complexity, we further introduce a sequential version of Huber's estimator that exhibits linear complexity. We leverage this sequential estimator to design SeqHubUCB that enjoys similar regret guarantees while reducing the computational burden. Finally, we experimentally illustrate the efficiency of HubUCB and SeqHubUCB in solving corrupted bandits for different reward distributions and different levels of corruptions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题