论文标题

遗憾的是平衡强盗和RL模型选择

Regret Balancing for Bandit and RL Model Selection

论文作者

Abbasi-Yadkori, Yasin, Pacchiano, Aldo, Phan, My

论文摘要

我们考虑在随机匪徒和增强学​​习问题中选择模型。鉴于一组基础学习算法,有效的模型选择策略以在线方式适应了最佳学习算法。我们表明,通过估计每种算法的遗憾并播放算法,以确保所有经验的遗憾都具有相同的顺序,总体遗憾平衡策略实现了遗憾,这几乎是最佳基础算法的遗憾。我们的策略需要对最佳基础后悔的上限进行输入,而策略的性能取决于上限的紧密度。我们表明,拥有此先验知识是必要的,以实现近乎最理想的遗憾。此外,我们表明,任何近乎最佳的模型选择策略都隐含地执行遗憾平衡的形式。

We consider model selection in stochastic bandit and reinforcement learning problems. Given a set of base learning algorithms, an effective model selection strategy adapts to the best learning algorithm in an online fashion. We show that by estimating the regret of each algorithm and playing the algorithms such that all empirical regrets are ensured to be of the same order, the overall regret balancing strategy achieves a regret that is close to the regret of the optimal base algorithm. Our strategy requires an upper bound on the optimal base regret as input, and the performance of the strategy depends on the tightness of the upper bound. We show that having this prior knowledge is necessary in order to achieve a near-optimal regret. Further, we show that any near-optimal model selection strategy implicitly performs a form of regret balancing.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源