离散马尔可夫决策过程的安全政策改进方法

论文标题

离散马尔可夫决策过程的安全政策改进方法

Safe Policy Improvement Approaches on Discrete Markov Decision Processes

论文作者

Scholl, Philipp, Dietrich, Felix, Otte, Clemens, Udluft, Steffen

论文摘要

安全政策改进（SPI）旨在可证明的保证，即学习政策至少与给定的基线政策一样好。在SPI的基础上，Nadjahi等人的软基线自举（软SPIBB）在其方法中确定了理论问题，提供了校正的理论，并得出了一种新算法，该算法在有限的马尔可夫决策过程（MDP）上是可以安全安全的。此外，我们还提供了一种启发式算法，该算法在两个不同基准测试中表现出许多最先进的SPI算法表现出最好的性能。此外，我们介绍了SPI算法的分类法，并从经验上展示了两类SPI算法的有趣属性：虽然将算法的平均表现纳入了对行动价值的惩罚，但对行动价值的惩罚却更高，积极地限制了一系列政策的集合，从而更加一致地生成了良好的政策和良好的政策，因此，SAFER，SAFER，SAFER。

Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题