基于模型的离线增强学习和悲观主义调节的动态信念

论文标题

基于模型的离线增强学习和悲观主义调节的动态信念

Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief

论文作者

Guo, Kaiyang, Shao, Yunfeng, Geng, Yanhui

论文摘要

基于模型的离线增强学习（RL）旨在通过利用先前收集的静态数据集和动态模型来找到高度有益的政策。尽管动态模型通过重复使用静态数据集学习，但如果正确利用了静态数据，则有望促进政策学习。为此，几项作品建议量化预测动态的不确定性，并明确将其应用于惩罚奖励。但是，由于动态和奖励在MDP的背景下是本质上不同的因素，表征通过奖励惩罚的动态不确定性的影响可能会导致模型利用和避免风险之间的意外权衡。在这项工作中，我们将信仰分布在动态上，并通过从信念中偏见的采样来评估/优化政策。采样过程偏向于悲观主义，是基于离线RL的交替的Markov游戏公式得出的。我们正式表明，偏见的采样自然会以依赖政策的重新加权因素引起更新的动态信念，称为悲观主义调节的动态信念。为了改善政策，我们为游戏设计了一种迭代正规化策略优化算法，并保证在某些情况下单调改进。为了实用，我们进一步设计了一种离线RL算法以大致找到解决方案。经验结果表明，所提出的方法在各种基准任务上实现了最先进的性能。

Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. While the dynamics model learned through reuse of the static dataset, its generalization ability hopefully promotes policy learning if properly utilized. To that end, several works propose to quantify the uncertainty of predicted dynamics, and explicitly apply it to penalize reward. However, as the dynamics and the reward are intrinsically different factors in context of MDP, characterizing the impact of dynamics uncertainty through reward penalty may incur unexpected tradeoff between model utilization and risk avoidance. In this work, we instead maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. The sampling procedure, biased towards pessimism, is derived based on an alternating Markov game formulation of offline RL. We formally show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. To improve policy, we devise an iterative regularized policy optimization algorithm for the game, with guarantee of monotonous improvement under certain condition. To make practical, we further devise an offline RL algorithm to approximately find the solution. Empirical results show that the proposed approach achieves state-of-the-art performance on a wide range of benchmark tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题