论文标题
脱机政策优化的保守贝叶斯基于模型的价值扩展
Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization
论文作者
论文摘要
离线增强学习(RL)解决了通过遵循某些行为策略收集的固定数据学习绩效策略的问题。基于模型的方法在离线设置中特别有吸引力,因为它们可以通过学习环境模型从记录的数据集中提取更多的学习信号。但是,由于学习模型中估计错误的复杂性,现有基于模型的方法的性能不足于无模型。在这一观察结果的推动下,我们认为,对于基于模型的方法,了解何时信任模型,何时依赖于模型的估计以及如何保守的W.R.T.至关重要。两个都。为此,我们得出了一种优雅而简单的方法论,称为“保守贝叶斯模型”的离线策略优化(CBOP)的价值扩展(CBOP),该方法根据其认知不确定性在政策评估步骤中取消了无模型和基于模型的估计,并通过对贝耶斯式后式后价值估计来促进保守主义。在标准D4RL连续控制任务上,我们发现我们的方法明显优于先前基于模型的方法:此外,CBOP在$ 18 $基准数据集中以$ 11 $的价格达到最新的性能,同时在其余数据集上进行标准。
Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by $116.4$%, MOReL by $23.2$% and COMBO by $23.7$%. Further, CBOP achieves state-of-the-art performance on $11$ out of $18$ benchmark datasets while doing on par on the remaining datasets.