CERTRL：在COQ中为价值和政策迭代的融合证明形式化

论文标题

CERTRL：在COQ中为价值和政策迭代的融合证明形式化

CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq

论文作者

Vajjha, Koundinya, Shinnar, Avraham, Pestun, Vasily, Trager, Barry, Fulton, Nathan

论文摘要

强化学习算法通过优化长期奖励来解决概率环境中的顺序决策问题。在安全至关重要的环境中使用加强学习的愿望激发了关于正式约束强化学习的最新工作。但是，这些方法将学习算法的实现放在其可信赖的计算基础中。这些实现的关键正确性属性是保证学习算法会收敛到最佳策略的保证。本文通过开发两个规范强化学习算法的COQ形式化开始了缩小这一差距的工作：有限国家马尔可夫决策过程的价值和政策迭代。中心结果是对贝尔曼的最优原理的形式化及其证明，该原理使用贝尔曼最佳算法的收缩特性来确定序列在无限的地平线限制中收敛。 CERTRL的开发体现了如何用于增强学习算法的Giry Monad和机械化指标简化的最佳优势。 CERTRL库提供了一个通用框架，以证明有关马尔可夫决策过程和强化学习算法的属性，为进一步的增强学习算法的正式化铺平了道路。

Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy. This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of Bellman's optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题