论文标题
通过约束优化赎回内在奖励
Redeeming Intrinsic Rewards via Constrained Optimization
论文作者
论文摘要
最先进的加固学习(RL)算法通常使用随机抽样(例如$ε$ - 果岭)进行探索,但是这种方法在诸如蒙特祖玛的复仇之类的硬探索任务上失败了。为了应对勘探的挑战,先前的工作通过奖励代理商访问新颖状态来激励探索。这种内在的奖励(也称为探索奖金或好奇心)通常会在硬探索任务上表现出色。但是,在简单的探索任务中,即使有足够的任务(也称为外部)奖励,代理会因内在奖励而分心,并执行不必要的探索。因此,这种过于好奇的代理人的表现要比只有只有任务奖励的代理商糟糕。跨任务的性能不一致可阻止使用RL算法广泛使用内在的奖励。我们提出了一个称为外在的策略优化(EIPO)的有原则的约束优化程序,该程序会自动调整固有奖励的重要性:当不必要的探索时,它会抑制内在的奖励并在需要探索时增加它。结果是优越的探索,不需要手动调整固有奖励与任务奖励的平衡。在六十一场Atari游戏中,一致的表现验证了我们的主张。该代码可在https://github.com/improbable-ai/eipo上找到。
State-of-the-art reinforcement learning (RL) algorithms typically use random sampling (e.g., $ε$-greedy) for exploration, but this method fails on hard exploration tasks like Montezuma's Revenge. To address the challenge of exploration, prior works incentivize exploration by rewarding the agent when it visits novel states. Such intrinsic rewards (also called exploration bonus or curiosity) often lead to excellent performance on hard exploration tasks. However, on easy exploration tasks, the agent gets distracted by intrinsic rewards and performs unnecessary exploration even when sufficient task (also called extrinsic) reward is available. Consequently, such an overly curious agent performs worse than an agent trained with only task reward. Such inconsistency in performance across tasks prevents the widespread use of intrinsic rewards with RL algorithms. We propose a principled constrained optimization procedure called Extrinsic-Intrinsic Policy Optimization (EIPO) that automatically tunes the importance of the intrinsic reward: it suppresses the intrinsic reward when exploration is unnecessary and increases it when exploration is required. The results is superior exploration that does not require manual tuning in balancing the intrinsic reward against the task reward. Consistent performance gains across sixty-one ATARI games validate our claim. The code is available at https://github.com/Improbable-AI/eipo.