学习利用塑造奖励：一种新的奖励塑造方法

论文标题

学习利用塑造奖励：一种新的奖励塑造方法

Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping

论文作者

Hu, Yujing, Wang, Weixun, Jia, Hangtian, Wang, Yixiang, Chen, Yingfeng, Hao, Jianye, Wu, Feng, Fan, Changjie

论文摘要

奖励成型是将领域知识纳入加强学习（RL）的有效技术。现有的方法（例如基于潜在的奖励成型）通常会充分利用给定的塑造奖励功能。但是，由于人类知识转化为数字奖励价值，通常由于人类认知偏见等原因而不完美，因此完全利用塑造奖励功能可能无法改善RL算法的性能。在本文中，我们考虑了适应利用给定的塑造奖励功能的问题。我们将塑造奖励的利用作为双层优化问题，在该问题中，下层是使用塑造奖励优化策略，而上层是优化一个参数化的体重函数，以实现真正的奖励最大化。我们在塑形重量函数参数方面正式得出了预期的真实奖励的梯度，因此基于不同的假设提出了三种学习算法。在稀疏回报的Cartpole和Mujoco环境中进行的实验表明，我们的算法可以完全利用有益的塑造奖励，同时忽略了无效的塑造奖励，甚至可以将它们转化为有益的奖励。

Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL). Existing approaches such as potential-based reward shaping normally make full use of a given shaping reward function. However, since the transformation of human knowledge into numeric reward values is often imperfect due to reasons such as human cognitive bias, completely utilizing the shaping reward function may fail to improve the performance of RL algorithms. In this paper, we consider the problem of adaptively utilizing a given shaping reward function. We formulate the utilization of shaping rewards as a bi-level optimization problem, where the lower level is to optimize policy using the shaping rewards and the upper level is to optimize a parameterized shaping weight function for true reward maximization. We formally derive the gradient of the expected true reward with respect to the shaping weight function parameters and accordingly propose three learning algorithms based on different assumptions. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards, and meanwhile ignore unbeneficial shaping rewards or even transform them into beneficial ones.

下载PDF全文

下载文献需遵守相关版权规定

论文标题