论文标题
随机系统的在线屏蔽
Online Shielding for Stochastic Systems
论文作者
论文摘要
在本文中,我们提出了一种开发可信赖的强化学习系统的方法。为了确保安全,尤其是在探索过程中,我们会自动合成一个称为“盾牌”的正确构造运行时执行器,该运行时执行器阻止了代理商的时间逻辑规范不安全的所有操作。我们的主要贡献是用于在线计算屏蔽的新合成算法。现有的离线屏蔽方法对所有状态行动组合的安全性进行了详尽的计算,从而导致巨大的离线计算时间,大量的内存消耗以及由于庞大的数据库中的查找而在运行时的重大延迟。在线屏蔽背后的直觉是在运行时计算在不久的将来可以达到的所有州的集合。对于这些状态中的每一个,一旦达到考虑的状态之一,就可以分析所有可用动作的安全性。我们提出的方法是一般的,可以应用于随机行为的广泛计划问题。为了进行评估,我们选择了经典计算机游戏蛇的2播放器版本。该游戏需要快速的决策,而多人游戏设置会引起较大的状态空间,计算在计算上详尽地分析。避免碰撞的安全目标很容易转移到各种计划任务中。
In this paper, we propose a method to develop trustworthy reinforcement learning systems. To ensure safety especially during exploration, we automatically synthesize a correct-by-construction runtime enforcer, called a shield, that blocks all actions that are unsafe with respect to a temporal logic specification from the agent. Our main contribution is a new synthesis algorithm for computing the shield online. Existing offline shielding approaches compute exhaustively the safety of all states-action combinations ahead-of-time, resulting in huge offline computation times, large memory consumption, and significant delays at run-time due to the look-ups in a huge database. The intuition behind online shielding is to compute during run-time the set of all states that could be reached in the near future. For each of these states, the safety of all available actions is analysed and used for shielding as soon as one of the considered states is reached. Our proposed method is general and can be applied to a wide range of planning problems with stochastic behavior. For our evaluation, we selected a 2-player version of the classical computer game SNAKE. The game requires fast decisions and the multiplayer setting induces a large state space, computationally expensive to analyze exhaustively. The safety objective of collision avoidance is easily transferable to a variety of planning tasks.