强化学习与稀疏奖励下的不匹配任务的演示

论文标题

强化学习与稀疏奖励下的不匹配任务的演示

Reinforcement learning with Demonstrations from Mismatched Task under Sparse Reward

论文作者

Guo, Yanjiang, Gao, Jingyue, Wu, Zheng, Shi, Chengming, Chen, Jianyu

论文摘要

强化学习通常会遭受现实世界机器人问题中稀疏奖励问题的困扰。从演示中学习（LFD）是消除此问题的有效方法，该问题利用了收集的专家数据来帮助在线学习。先前的工作通常假定学习代理和专家的目标是完成相同的任务，这需要为每个新任务收集新数据。在本文中，我们考虑了目标任务不匹配但与专家相似的情况。这种设置可能具有挑战性，我们发现现有的LFD方法无法有效地指导学习不匹配的新任务，而稀疏的奖励。我们提出了示范（CRSFD）的保守奖励塑造，该奖励使用估计的专家价值函数来塑造稀疏的奖励。为了加速学习过程，CRSFD指导代理商围绕示范进行保守探索。机器人操纵任务的实验结果表明，我们的方法在将单个任务中收集的示范收集到其他不同但相似的任务时，我们的方法优于基线LFD方法。

Reinforcement learning often suffer from the sparse reward issue in real-world robotics problems. Learning from demonstration (LfD) is an effective way to eliminate this problem, which leverages collected expert data to aid online learning. Prior works often assume that the learning agent and the expert aim to accomplish the same task, which requires collecting new data for every new task. In this paper, we consider the case where the target task is mismatched from but similar with that of the expert. Such setting can be challenging and we found existing LfD methods can not effectively guide learning in mismatched new tasks with sparse rewards. We propose conservative reward shaping from demonstration (CRSfD), which shapes the sparse rewards using estimated expert value function. To accelerate learning processes, CRSfD guides the agent to conservatively explore around demonstrations. Experimental results of robot manipulation tasks show that our approach outperforms baseline LfD methods when transferring demonstrations collected in a single task to other different but similar tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题