论文标题

对马尔可夫政策梯度算法的多臂匪徒的遗憾分析

Regret Analysis of a Markov Policy Gradient Algorithm for Multi-arm Bandits

论文作者

Denisov, Denis, Walton, Neil

论文摘要

我们考虑将策略梯度算法应用于Bernoulli Rewards的有限臂强盗问题。我们允许学习率取决于算法的当前状态,而不是使用确定性的时间偿还学习率。该算法的状态在概率单纯性上形成了马尔可夫链。我们应用寄养裂解技术来分析马尔可夫链的稳定性。我们证明,如果选择了学习率,那么策略梯度算法是瞬态马尔可夫链,并且该链的状态在最佳臂上收敛,并以对数或poly-logarithmic的遗憾。

We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源