标准线性匪徒

论文标题

标准线性匪徒

Norm-Agnostic Linear Bandits

论文作者

Spencer, Gales, Sethuraman, Sunder, Jun, Kwang-Sung

论文摘要

线性匪徒具有多种应用程序，包括推荐系统，但它们做出了一个强有力的假设：该算法必须了解控制奖励生成的未知参数$θ^*$的上限$ s $。这样的假设迫使从业者猜测涉及置信度的$ s $，别无选择，只能希望$ \ |θ^*\ | \ le s $是真实的，以确保遗憾会很低。在本文中，我们提出了新的算法，这些算法首次不需要这种知识。具体来说，我们提出了两种算法并分析其遗憾界限：一个用于不断变化的手臂设置，另一个用于固定臂设置。我们对前者的遗憾表明，不知道$ s $的价格不会在遗憾中影响领先术语，并且仅夸大了下级期限。对于后者，我们不会为现在知道$ S $付出任何遗憾。我们的数值实验表明，假设$ s $的知识可能会灾难性地失败，而当$ \ |> | the |θ^*\ | \ le s $不正确时，则可能会灾难性地失败。

Linear bandits have a wide variety of applications including recommendation systems yet they make one strong assumption: the algorithms must know an upper bound $S$ on the norm of the unknown parameter $θ^*$ that governs the reward generation. Such an assumption forces the practitioner to guess $S$ involved in the confidence bound, leaving no choice but to wish that $\|θ^*\|\le S$ is true to guarantee that the regret will be low. In this paper, we propose novel algorithms that do not require such knowledge for the first time. Specifically, we propose two algorithms and analyze their regret bounds: one for the changing arm set setting and the other for the fixed arm set setting. Our regret bound for the former shows that the price of not knowing $S$ does not affect the leading term in the regret bound and inflates only the lower order term. For the latter, we do not pay any price in the regret for now knowing $S$. Our numerical experiments show standard algorithms assuming knowledge of $S$ can fail catastrophically when $\|θ^*\|\le S$ is not true whereas our algorithms enjoy low regret.

下载PDF全文

下载文献需遵守相关版权规定

论文标题