论文标题

通过未来的依赖选项概括LTL指令

Generalizing LTL Instructions via Future Dependent Options

论文作者

Xu, Duo, Fekri, Faramarz

论文摘要

在控制系统和机器人技术的许多实际应用中,线性时间逻辑(LTL)是一种广泛使用的任务规范语言,具有组成语法,自然会跨任务跨任务,包括条件和替代实现。通过LTL任务进行RL的一个重要问题是学习任务条件的政策,这些政策可以零拍摄到培训中未观察到的新LTL指令。但是,由于符号观察通常是有可能的,而LTL任务可能会很长时间,因此以前的工作可能会遭受诸如训练抽样效率低下以及发现解决方案的次要性的问题。为了解决这些问题,本文提出了一种新型的多任务RL算法,具有提高的学习效率和最佳性。为了实现任务完成的全球最优性,我们建议通过一种新颖的非政策方法来学习取决于未来子目标的选项。为了更有效地传播满足未来子目标的回报,我们建议培训以子目标序列为条件的多步价值功能,该功能通过蒙特卡洛估计多步折扣回报进行了更新。在三个不同领域的实验中,我们评估了通过所提出的方法训练的代理的LTL泛化能力,显示出比以前的代表性方法的优势。

In many real-world applications of control system and robotics, linear temporal logic (LTL) is a widely-used task specification language which has a compositional grammar that naturally induces temporally extended behaviours across tasks, including conditionals and alternative realizations. An important problem in RL with LTL tasks is to learn task-conditioned policies which can zero-shot generalize to new LTL instructions not observed in the training. However, because symbolic observation is often lossy and LTL tasks can have long time horizon, previous works can suffer from issues such as training sampling inefficiency and infeasibility or sub-optimality of the found solutions. In order to tackle these issues, this paper proposes a novel multi-task RL algorithm with improved learning efficiency and optimality. To achieve the global optimality of task completion, we propose to learn options dependent on the future subgoals via a novel off-policy approach. In order to propagate the rewards of satisfying future subgoals back more efficiently, we propose to train a multi-step value function conditioned on the subgoal sequence which is updated with Monte Carlo estimates of multi-step discounted returns. In experiments on three different domains, we evaluate the LTL generalization capability of the agent trained by the proposed method, showing its advantage over previous representative methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源