关于匪徒问题中软性最大和政策梯度的简短说明

论文标题

关于匪徒问题中软性最大和政策梯度的简短说明

A Short Note on Soft-max and Policy Gradients in Bandits Problems

论文作者

Walton, Neil

论文摘要

这是关于bandit问题中软马克斯的lyapunov函数参数的简短交流。有许多出色的论文使用微分方程来进行策略梯度算法，以加固学习\ cite {agarwal2019optimality，bhandari2019global，mei202020global}。我们给出了一个简短的论点，为匪徒问题的软最大差异方程式带来了遗憾。对于不同的策略梯度算法，我们获得了类似的结果，再次解决匪徒问题。对于第二种算法，可以在随机情况\ cite {dw20}中证明后悔的界限。最后，我们总结了一些想法和问题，以获取政策梯度的随机遗憾界限。

This is a short communication on a Lyapunov function argument for softmax in bandit problems. There are a number of excellent papers coming out using differential equations for policy gradient algorithms in reinforcement learning \cite{agarwal2019optimality,bhandari2019global,mei2020global}. We give a short argument that gives a regret bound for the soft-max ordinary differential equation for bandit problems. We derive a similar result for a different policy gradient algorithm, again for bandit problems. For this second algorithm, it is possible to prove regret bounds in the stochastic case \cite{DW20}. At the end, we summarize some ideas and issues on deriving stochastic regret bounds for policy gradients.

下载PDF全文

下载文献需遵守相关版权规定

论文标题