连续空间MDP的正式控制器合成通过无模型增强学习

论文标题

连续空间MDP的正式控制器合成通过无模型增强学习

Formal Controller Synthesis for Continuous-Space MDPs via Model-Free Reinforcement Learning

论文作者

Lavaei, Abolfazl, Somenzi, Fabio, Soudjani, Sadegh, Trivedi, Ashutosh, Zamani, Majid

论文摘要

提出了一种新型的强化学习计划，以合成连续空间马尔可夫决策过程（MDP）的策略。该方案使一个方案可以为有限的MDP应用无模型的，现成的增强学习算法，以计算相应连续空间MDP的最佳策略，而无需明确构建有限状态的抽象。所提出的方法是基于使用有限的MDP（不明确构建）的过渡概率，在抽象MDP上合成策略，然后在混凝土连续空间MDP上映射结果，并绘制出近似优化性保证。该系统感兴趣的属性属于线性时间逻辑的片段，即语法上的句法线性时间逻辑（SCLTL），而合成要求是最大化给定有限时间范围内满意度的可能性。本文的关键贡献是利用有限MDP的加强学习的经典收敛结果，并提供控制策略，以最大程度地提高对未知，连续空间MDP的满意度的可能性，同时提供概率紧密性保证。基于自动机的奖励功能通常很少。我们提出了一种新型的基于潜在的奖励成型技术，以产生浓厚的奖励以加快学习的速度。通过将其应用于调节房间温度，对道路交通电池的控制以及BMW 320i汽车的7维非线性模型的调节，证明了提出方法的有效性。

A novel reinforcement learning scheme to synthesize policies for continuous-space Markov decision processes (MDPs) is proposed. This scheme enables one to apply model-free, off-the-shelf reinforcement learning algorithms for finite MDPs to compute optimal strategies for the corresponding continuous-space MDPs without explicitly constructing the finite-state abstraction. The proposed approach is based on abstracting the system with a finite MDP (without constructing it explicitly) with unknown transition probabilities, synthesizing strategies over the abstract MDP, and then mapping the results back over the concrete continuous-space MDP with approximate optimality guarantees. The properties of interest for the system belong to a fragment of linear temporal logic, known as syntactically co-safe linear temporal logic (scLTL), and the synthesis requirement is to maximize the probability of satisfaction within a given bounded time horizon. A key contribution of the paper is to leverage the classical convergence results for reinforcement learning on finite MDPs and provide control strategies maximizing the probability of satisfaction over unknown, continuous-space MDPs while providing probabilistic closeness guarantees. Automata-based reward functions are often sparse; we present a novel potential-based reward shaping technique to produce dense rewards to speed up learning. The effectiveness of the proposed approach is demonstrated by applying it to three physical benchmarks concerning the regulation of a room's temperature, control of a road traffic cell, and of a 7-dimensional nonlinear model of a BMW 320i car.

下载PDF全文

下载文献需遵守相关版权规定

论文标题