对马尔可夫政策梯度算法的多臂匪徒的遗憾分析

论文标题

对马尔可夫政策梯度算法的多臂匪徒的遗憾分析

Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

论文作者

Denisov, Denis, Walton, Neil

论文摘要

我们考虑将策略梯度算法应用于Bernoulli Rewards的有限臂强盗问题。我们允许学习率取决于算法的当前状态，而不是使用确定性的时间偿还学习率。该算法的状态在概率单纯性上形成了马尔可夫链。我们应用寄养裂解技术来分析马尔可夫链的稳定性。我们证明，如果选择了学习率，那么策略梯度算法是瞬态马尔可夫链，并且该链的状态在最佳臂上收敛，并以对数或poly-logarithmic的遗憾。

We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.

下载PDF全文

下载文献需遵守相关版权规定

论文标题