基于模型的演员评论，对随机系统的机会限制

论文标题

基于模型的演员评论，对随机系统的机会限制

Model-Based Actor-Critic with Chance Constraint for Stochastic System

论文作者

Peng, Baiyu, Mu, Yao, Guan, Yang, Li, Shengbo Eben, Yin, Yuming, Chen, Jianyu

论文摘要

安全对于在现实情况下应用的加固学习（RL）至关重要。机会限制适合表示随机系统中的安全要求。以前的机会约束的RL方法通常具有较低的收敛率，或者仅学习保守政策。在本文中，我们提出了一种基于模型的偶然限制参与者批评（CCAC）算法，该算法可以有效地学习安全且非保守的政策。与现有优化保守的下限的现有方法不同，CCAC直接解决了原始的机会约束问题，在这种问题中，目标函数和安全概率通过自适应重量同时优化。为了提高收敛速率，CCAC利用动态模型的梯度来加速策略优化。 CCAC的有效性通过随机的汽车跟踪任务证明。实验表明，与以前的RL方法相比，CCAC可以提高性能，同时保证安全性，并以五倍的收敛速率。它的在线计算效率也比传统的安全技术高100倍，例如随机模型预测控制。

Safety is essential for reinforcement learning (RL) applied in real-world situations. Chance constraints are suitable to represent the safety requirements in stochastic systems. Previous chance-constrained RL methods usually have a low convergence rate, or only learn a conservative policy. In this paper, we propose a model-based chance constrained actor-critic (CCAC) algorithm which can efficiently learn a safe and non-conservative policy. Different from existing methods that optimize a conservative lower bound, CCAC directly solves the original chance constrained problems, where the objective function and safe probability is simultaneously optimized with adaptive weights. In order to improve the convergence rate, CCAC utilizes the gradient of dynamic model to accelerate policy optimization. The effectiveness of CCAC is demonstrated by a stochastic car-following task. Experiments indicate that compared with previous RL methods, CCAC improves the performance while guaranteeing safety, with a five times faster convergence rate. It also has 100 times higher online computation efficiency than traditional safety techniques such as stochastic model predictive control.

下载PDF全文

下载文献需遵守相关版权规定

论文标题