论文标题

LECO:可学习的特定任务内在奖励的情节计数

LECO: Learnable Episodic Count for Task-Specific Intrinsic Reward

论文作者

Jo, Daejin, Kim, Sungwoong, Nam, Daniel Wontae, Kwon, Taehwan, Rho, Seungeun, Kim, Jongmin, Lee, Donghoon

论文摘要

情节次数已被广泛用于设计一种简单而有效的内在动机,以稀疏的奖励来增强学习。但是,在高维状态空间以及在很长的一段时间内使用情节计数需要彻底的状态压缩和快速哈希,这阻碍了在如此艰难而复杂的探索环境中严格剥削它。此外,在情节计数中,任务界观察的干扰可能会导致其内在动机,以忽视与任务相关的重要变化,而新颖性以情节方式可能导致跨情节反复重新审视熟悉的状态。为了解决这些问题,在本文中,我们提出了一个可学习的基于哈希的情节数,我们将其命名为LECO,该计数有效地在硬探索问题中作为特定于任务的内在奖励。特别是,提出的固有奖励由情节性新颖性和特定于任务的调制组成,其中前者采用矢量量化的变异自动编码器来自动获得离散的状态代码,以快速计数,而后者通过学习调制器来优化任务特定于特定的外部奖励来调节情节新颖性。拟议的LECO专门使从探索到强化学习期间的剥削的自动过渡。我们在实验上表明,与先前的探索方法相比,LECO成功地解决了硬探索问题,并通过Miligrid和DMLAB环境中最困难的任务扩展到大型状态空间。

Episodic count has been widely used to design a simple yet effective intrinsic motivation for reinforcement learning with a sparse reward. However, the use of episodic count in a high-dimensional state space as well as over a long episode time requires a thorough state compression and fast hashing, which hinders rigorous exploitation of it in such hard and complex exploration environments. Moreover, the interference from task-irrelevant observations in the episodic count may cause its intrinsic motivation to overlook task-related important changes of states, and the novelty in an episodic manner can lead to repeatedly revisit the familiar states across episodes. In order to resolve these issues, in this paper, we propose a learnable hash-based episodic count, which we name LECO, that efficiently performs as a task-specific intrinsic reward in hard exploration problems. In particular, the proposed intrinsic reward consists of the episodic novelty and the task-specific modulation where the former employs a vector quantized variational autoencoder to automatically obtain the discrete state codes for fast counting while the latter regulates the episodic novelty by learning a modulator to optimize the task-specific extrinsic reward. The proposed LECO specifically enables the automatic transition from exploration to exploitation during reinforcement learning. We experimentally show that in contrast to the previous exploration methods LECO successfully solves hard exploration problems and also scales to large state spaces through the most difficult tasks in MiniGrid and DMLab environments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源