幻觉值：与不完美环境模型的Dyna风格计划的陷阱

论文标题

幻觉值：与不完美环境模型的Dyna风格计划的陷阱

Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models

论文作者

Jafferjee, Taher, Imani, Ehsan, Talvitie, Erin, White, Martha, Bowling, Micheal

论文摘要

DYNA风格的增强学习（RL）代理通过通过环境模型生成的模拟经验来更新价值函数，从而提高了无模型RL代理的样品效率。但是，通常很难学习准确的环境动力学模型，即使是小错误也可能导致Dyna剂失败。在本文中，我们研究了一种模型错误：幻觉状态。这些是模型产生的状态，但不是环境的真实状态。我们介绍了幻觉的价值假说（HVH）：对幻觉状态值的真实状态的更新值会导致误导状态行动值，从而对控制策略产生不利影响。我们讨论和评估四个DYNA变体；三个更新真实状态，以模拟（因此可能幻觉）状态，而一个状态则没有。实验结果为HVH提供了证据，因此暗示了开发DYNA算法的富有成果的方向以鲁棒为模型误差。

Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we investigate one type of model error: hallucinated states. These are states generated by the model, but that are not real states of the environment. We present the Hallucinated Value Hypothesis (HVH): updating values of real states towards values of hallucinated states results in misleading state-action values which adversely affect the control policy. We discuss and evaluate four Dyna variants; three which update real states toward simulated -- and therefore potentially hallucinated -- states and one which does not. The experimental results provide evidence for the HVH thus suggesting a fruitful direction toward developing Dyna algorithms robust to model error.

下载PDF全文

下载文献需遵守相关版权规定

论文标题