ALIGN舵：通过奖励再分配从很少的示威中学习

论文标题

ALIGN舵：通过奖励再分配从很少的示威中学习

Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution

论文作者

Patil, Vihang P., Hofmarcher, Markus, Dinu, Marius-Constantin, Dorfer, Matthias, Blies, Patrick M., Brandstetter, Johannes, Arjona-Medina, Jose A., Hochreiter, Sepp

论文摘要

强化学习算法在解决稀疏和延迟奖励的复杂分层任务时需要许多样本。对于此类复杂的任务，最近提出的方向舵使用奖励再分配来利用与完成子任务相关的Q功能中的步骤。但是，由于当前的探索策略无法在合理的时间内发现它们，因此通常只有很少有具有高回报的情节作为示范。在这项工作中，我们介绍了Align-rudder，该依赖器利用一个配置文件模型来进行奖励重新分布，该模型是从示范的多个序列比对获得的。因此，Align-Rudder有效地采用了奖励再分配，从而大大改善了很少的演示学习。一致王子的表现优于竞争对手，在复杂的人工任务上，奖励延迟，几乎没有示威。在Minecraft获得Diamond的任务上，Align-Rudder能够挖掘钻石，尽管不经常。代码可在https://github.com/ml-jku/align-rudder上找到。 YouTube：https：//youtu.be/ho-_8zul-uy

Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at https://github.com/ml-jku/align-rudder. YouTube: https://youtu.be/HO-_8ZUl-UY

下载PDF全文

下载文献需遵守相关版权规定

论文标题