基于模型的强化学习以及多步计划价值估计

论文标题

基于模型的强化学习以及多步计划价值估计

Model-based Reinforcement Learning with Multi-step Plan Value Estimation

论文作者

Lin, Haoxin, Sun, Yihao, Zhang, Jiaji, Yu, Yang

论文摘要

提高增强学习样本效率的一种有希望的方法是基于模型的方法，其中在学习模型中可以进行许多探索和评估以节省现实世界样本。但是，当学习模型具有不可忽略的模型误差时，很难准确评估模型中的顺序步骤，从而限制了模型的利用率。本文建议通过引入多步计划来替换基于模型的RL的多步骤操作来减轻此问题。我们采用多步计划价值估计，该估计在执行给定状态的一系列行动计划后评估了预期的折扣收益，并通过直接通过计划价值估计来直接计算多步策略梯度来更新策略。新的基于模型的强化学习算法MPPVE（基于模型的计划策略学习和多步计划值估计）显示了对学习模型的利用率更好，并且比基于最先进的模型的RL方法更好地利用了学习模型。

A promising way to improve the sample efficiency of reinforcement learning is model-based methods, in which many explorations and evaluations can happen in the learned models to save real-world samples. However, when the learned model has a non-negligible model error, sequential steps in the model are hard to be accurately evaluated, limiting the model's utilization. This paper proposes to alleviate this issue by introducing multi-step plans to replace multi-step actions for model-based RL. We employ the multi-step plan value estimation, which evaluates the expected discounted return after executing a sequence of action plans at a given state, and updates the policy by directly computing the multi-step policy gradient via plan value estimation. The new model-based reinforcement learning algorithm MPPVE (Model-based Planning Policy Learning with Multi-step Plan Value Estimation) shows a better utilization of the learned model and achieves a better sample efficiency than state-of-the-art model-based RL approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题