在基于模型的随机价值梯度上，用于连续增强学习

论文标题

在基于模型的随机价值梯度上，用于连续增强学习

On the model-based stochastic value gradient for continuous reinforcement learning

论文作者

Amos, Brandon, Stanton, Samuel, Yarats, Denis, Wilson, Andrew Gordon

论文摘要

十多年来，基于模型的增强学习一直被视为利用基于控制的领域知识来提高强化学习剂的样本效率。尽管基于模型的代理在概念上具有吸引力，但其政策往往落后于最终奖励，尤其是在非平凡环境中的无模型代理。作为响应，研究人员提出了基于模型的代理，具有越来越复杂的组件，从概率动力学模型的集合到缓解模型误差的启发式方法。在这种趋势的逆转中，我们表明，基于模型的代理可以从不仅匹配的现有想法中得出，而且在样本效率和最终奖励方面都优于最先进的无模型代理。我们发现，用于政策评估的无模型软价值估计值和基于模型的随机价值梯度改进是有效的组合，可以在高维类人体控制任务上实现最新结果，大多数基于模型的代理无法解决。我们的发现表明，基于模型的政策评估值得关注。

For over a decade, model-based reinforcement learning has been seen as a way to leverage control-based domain knowledge to improve the sample-efficiency of reinforcement learning agents. While model-based agents are conceptually appealing, their policies tend to lag behind those of model-free agents in terms of final reward, especially in non-trivial environments. In response, researchers have proposed model-based agents with increasingly complex components, from ensembles of probabilistic dynamics models, to heuristics for mitigating model error. In a reversal of this trend, we show that simple model-based agents can be derived from existing ideas that not only match, but outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward. We find that a model-free soft value estimate for policy evaluation and a model-based stochastic value gradient for policy improvement is an effective combination, achieving state-of-the-art results on a high-dimensional humanoid control task, which most model-based agents are unable to solve. Our findings suggest that model-based policy evaluation deserves closer attention.

下载PDF全文

下载文献需遵守相关版权规定

论文标题