混合可观察性下的分层增强学习

论文标题

混合可观察性下的分层增强学习

Hierarchical Reinforcement Learning under Mixed Observability

论文作者

Nguyen, Hai, Yang, Zhihan, Baisero, Andrea, Ma, Xiao, Platt, Robert, Amato, Christopher

论文摘要

可观察到的马尔可夫决策过程（MOMDP）的框架模拟了许多机器人域，其中某些状态变量是完全可观察到的，而其他状态变量则不可观察到。在这项工作中，我们确定了一个重要的MOMDP子类，该子类是由行动如何影响国家完全可观察到的组成部分的定义，而这些组成部分如何影响部分可观察到的组成部分和奖励。这种独特的属性允许在混合可观察性（HILMO）下采用两级分层方法，称为分层增强学习，该方法将部分可观察性限制在最高级别，而底层仍然可以观察到完全可观察到，从而实现了更高的学习效率。最高级别会产生所需的目标，直到解决任务为止。我们进一步发展理论保证，以表明我们的方法可以在轻度假设下实现最佳和准最佳行为。长马持续控制任务的经验结果证明了我们方法的功效和效率，这是在提高的成功率，样本效率和墙壁锁定训练时间方面的效率和效率。我们还部署了在真实机器人上模拟中学习的政策。

The framework of mixed observable Markov decision processes (MOMDP) models many robotic domains in which some state variables are fully observable while others are not. In this work, we identify a significant subclass of MOMDPs defined by how actions influence the fully observable components of the state and how those, in turn, influence the partially observable components and the rewards. This unique property allows for a two-level hierarchical approach we call HIerarchical Reinforcement Learning under Mixed Observability (HILMO), which restricts partial observability to the top level while the bottom level remains fully observable, enabling higher learning efficiency. The top level produces desired goals to be reached by the bottom level until the task is solved. We further develop theoretical guarantees to show that our approach can achieve optimal and quasi-optimal behavior under mild assumptions. Empirical results on long-horizon continuous control tasks demonstrate the efficacy and efficiency of our approach in terms of improved success rate, sample efficiency, and wall-clock training time. We also deploy policies learned in simulation on a real robot.

下载PDF全文

下载文献需遵守相关版权规定

论文标题