EMAQ：预期的MAX Q学习操作员，用于简单但有效的离线和在线RL

论文标题

EMAQ：预期的MAX Q学习操作员，用于简单但有效的离线和在线RL

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL

论文作者

Ghasemipour, Seyed Kamyar Seyed, Schuurmans, Dale, Gu, Shixiang Shane

论文摘要

非政策强化学习可以通过利用过去的经验来实现样本有效学习决策政策的希望。但是，在离线RL设置中 - 提供了固定的交互收集并且不允许进一步的交互作用 - 已表明标准的非政策RL方法的表现可能会大大不足。最近提出的方法通常旨在通过限制学习的策略保持接近给定的交互数据集来解决这一缺点。在这项工作中，我们密切研究了BCQ的重要简化（一种先前的离线RL方法），该方法可以消除启发式设计选择，并自然限制提取的政策，以确切地支持给定的行为政策。重要的是，与它们的原始理论考虑相反，我们通过引入新型备份操作员预期的Q-学习（EMAQ）来得出这种简化的算法，该操作员与所得的实用算法更加紧密相关。具体而言，除了分布支持外，EMAQ明确考虑了样品数量和提议分布，从而使我们能够得出新的亚典型性界限，这可以作为离线RL问题的复杂性的新颖量度。在离线RL设置（这项工作的主要重点）中，EMAQ在D4RL基准测试中匹配并胜过先前的先验。在在线RL环境中，我们证明EMAQ与软演员评论家具有竞争力。我们的经验发现的关键贡献表明，仔细的生成模型设计对估计行为策略的重要性，以及对于离线RL问题的直观概念。 EMAQ凭借其简单的解释和更少的移动部件（例如代表策略的明确函数近似器），是一个强大而易于实现的基线，用于将来的工作。

Off-policy reinforcement learning holds the promise of sample-efficient learning of decision-making policies by leveraging past experience. However, in the offline RL setting -- where a fixed collection of interactions are provided and no further interactions are allowed -- it has been shown that standard off-policy RL methods can significantly underperform. Recently proposed methods often aim to address this shortcoming by constraining learned policies to remain close to the given dataset of interactions. In this work, we closely investigate an important simplification of BCQ -- a prior approach for offline RL -- which removes a heuristic design choice and naturally restricts extracted policies to remain exactly within the support of a given behavior policy. Importantly, in contrast to their original theoretical considerations, we derive this simplified algorithm through the introduction of a novel backup operator, Expected-Max Q-Learning (EMaQ), which is more closely related to the resulting practical algorithm. Specifically, in addition to the distribution support, EMaQ explicitly considers the number of samples and the proposal distribution, allowing us to derive new sub-optimality bounds which can serve as a novel measure of complexity for offline RL problems. In the offline RL setting -- the main focus of this work -- EMaQ matches and outperforms prior state-of-the-art in the D4RL benchmarks. In the online RL setting, we demonstrate that EMaQ is competitive with Soft Actor Critic. The key contributions of our empirical findings are demonstrating the importance of careful generative model design for estimating behavior policies, and an intuitive notion of complexity for offline RL problems. With its simple interpretation and fewer moving parts, such as no explicit function approximator representing the policy, EMaQ serves as a strong yet easy to implement baseline for future work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题