论文标题
关于匪徒问题中软性最大和政策梯度的简短说明
A Short Note on Soft-max and Policy Gradients in Bandits Problems
论文作者
论文摘要
这是关于bandit问题中软马克斯的lyapunov函数参数的简短交流。有许多出色的论文使用微分方程来进行策略梯度算法,以加固学习\ cite {agarwal2019optimality,bhandari2019global,mei202020global}。我们给出了一个简短的论点,为匪徒问题的软最大差异方程式带来了遗憾。对于不同的策略梯度算法,我们获得了类似的结果,再次解决匪徒问题。对于第二种算法,可以在随机情况\ cite {dw20}中证明后悔的界限。最后,我们总结了一些想法和问题,以获取政策梯度的随机遗憾界限。
This is a short communication on a Lyapunov function argument for softmax in bandit problems. There are a number of excellent papers coming out using differential equations for policy gradient algorithms in reinforcement learning \cite{agarwal2019optimality,bhandari2019global,mei2020global}. We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems. We derive a similar result for a different policy gradient algorithm, again for bandit problems. For this second algorithm, it is possible to prove regret bounds in the stochastic case \cite{DW20}. At the end, we summarize some ideas and issues on deriving stochastic regret bounds for policy gradients.