论文标题
离散马尔可夫决策过程的安全政策改进方法
Safe Policy Improvement Approaches on Discrete Markov Decision Processes
论文作者
论文摘要
安全政策改进(SPI)旨在可证明的保证,即学习政策至少与给定的基线政策一样好。在SPI的基础上,Nadjahi等人的软基线自举(软SPIBB)在其方法中确定了理论问题,提供了校正的理论,并得出了一种新算法,该算法在有限的马尔可夫决策过程(MDP)上是可以安全安全的。此外,我们还提供了一种启发式算法,该算法在两个不同基准测试中表现出许多最先进的SPI算法表现出最好的性能。此外,我们介绍了SPI算法的分类法,并从经验上展示了两类SPI算法的有趣属性:虽然将算法的平均表现纳入了对行动价值的惩罚,但对行动价值的惩罚却更高,积极地限制了一系列政策的集合,从而更加一致地生成了良好的政策和良好的政策,因此,SAFER,SAFER,SAFER。
Safe Policy Improvement (SPI) aims at provable guarantees that a learned policy is at least approximately as good as a given baseline policy. Building on SPI with Soft Baseline Bootstrapping (Soft-SPIBB) by Nadjahi et al., we identify theoretical issues in their approach, provide a corrected theory, and derive a new algorithm that is provably safe on finite Markov Decision Processes (MDP). Additionally, we provide a heuristic algorithm that exhibits the best performance among many state of the art SPI algorithms on two different benchmarks. Furthermore, we introduce a taxonomy of SPI algorithms and empirically show an interesting property of two classes of SPI algorithms: while the mean performance of algorithms that incorporate the uncertainty as a penalty on the action-value is higher, actively restricting the set of policies more consistently produces good policies and is, thus, safer.