论文标题
通过椭圆形奖金探索
Exploration via Elliptical Episodic Bonuses
论文作者
论文摘要
近年来,已经提出了许多强化学习(RL)方法来探索各个情节不同的复杂环境。在这项工作中,我们表明,这些方法的有效性至关重要地依赖于基于计数的情节术语。结果,尽管它们在相对简单,无噪音的设置中取得了成功,但这些方法在更现实的情况下,状态空间很大且容易出现噪音。为了解决这一限制,我们通过椭圆形情节奖金(E3B)介绍了探索,这是一种新方法,将基于计数的情节奖金扩展到连续状态空间,并鼓励代理商探索每个情节中有多样的嵌入的状态。使用反向动力学模型学习嵌入,以捕获环境的可控方面。我们的方法为Minihack Suite的16个挑战性任务设置了一个新的最新技术,而无需特定于任务的归纳偏见。 E3B还匹配了有关稀疏奖励,基于像素的Vizdoom环境的现有方法,并且在栖息地的无奖励探索中胜过现有的方法,这表明它可以扩展到基于高维像素的观测值和现实环境。
In recent years, a number of reinforcement learning (RL) methods have been proposed to explore complex environments which differ across episodes. In this work, we show that the effectiveness of these methods critically relies on a count-based episodic term in their exploration bonus. As a result, despite their success in relatively simple, noise-free settings, these methods fall short in more realistic scenarios where the state space is vast and prone to noise. To address this limitation, we introduce Exploration via Elliptical Episodic Bonuses (E3B), a new method which extends count-based episodic bonuses to continuous state spaces and encourages an agent to explore states that are diverse under a learned embedding within each episode. The embedding is learned using an inverse dynamics model in order to capture controllable aspects of the environment. Our method sets a new state-of-the-art across 16 challenging tasks from the MiniHack suite, without requiring task-specific inductive biases. E3B also matches existing methods on sparse reward, pixel-based VizDoom environments, and outperforms existing methods in reward-free exploration on Habitat, demonstrating that it can scale to high-dimensional pixel-based observations and realistic environments.