论文标题
通过颠倒的增强学习学习相对回报政策
Learning Relative Return Policies With Upside-Down Reinforcement Learning
论文作者
论文摘要
最近,人们对使用监督学习来解决强化学习问题引起了人们的兴趣。该领域的最新工作主要集中在学习命令条件政策上。我们研究了一种这样一种方法的潜力 - 倒置增强学习 - 与指定某些标量值和观察到的回报之间的所需关系的命令一起工作。我们表明,颠倒的强化学习可以学会在表格的强盗设置和Cartpole中在线上进行此类命令,并具有非线性功能近似。通过这样做,我们演示了这种方法家族的力量,并在更复杂的命令结构下为它们的实际使用开辟了道路。
Lately, there has been a resurgence of interest in using supervised learning to solve reinforcement learning problems. Recent work in this area has largely focused on learning command-conditioned policies. We investigate the potential of one such method -- upside-down reinforcement learning -- to work with commands that specify a desired relationship between some scalar value and the observed return. We show that upside-down reinforcement learning can learn to carry out such commands online in a tabular bandit setting and in CartPole with non-linear function approximation. By doing so, we demonstrate the power of this family of methods and open the way for their practical use under more complicated command structures.