论文标题
深度约束Q学习
Deep Constrained Q-learning
论文作者
论文摘要
在许多现实世界的应用中,强化学习者必须在遵循某些规则或满足约束列表的同时优化多个目标。基于奖励成型的经典方法,即奖励信号中不同目标的加权组合,或拉格朗日方法(包括损失函数中的约束)不能保证该代理在所有时间点上满足所有时间点的约束,并且可能导致不希望的行为。当从动作值函数中提取离散策略时,可以通过限制最大化的动作空间来确保安全行动,但可以在可行的替代方案中导致次优的解决方案。在这项工作中,我们提出了一个受约束的Q学习,这是一个新颖的非政策强化学习框架,直接限制了Q-update中的动作空间,以了解诱导受约束的MDP和相应的安全策略的最佳Q功能。除了仅参考下一个操作的单步约束外,我们还基于截断的价值功能引入了当前目标策略下的近似多步骤约束的公式。我们在表格案例中分析了受约束Q学习的优势,并比较受约束的DQN,以奖励塑形和拉格朗日方法在应用于自主驾驶中的高级决策时,考虑到安全性,保持正确和舒适性的限制。我们在开源模拟器相扑和实际HighD数据集中训练代理。
In many real world applications, reinforcement learning agents have to optimize multiple objectives while following certain rules or satisfying a list of constraints. Classical methods based on reward shaping, i.e. a weighted combination of different objectives in the reward signal, or Lagrangian methods, including constraints in the loss function, have no guarantees that the agent satisfies the constraints at all points in time and can lead to undesired behavior. When a discrete policy is extracted from an action-value function, safe actions can be ensured by restricting the action space at maximization, but can lead to sub-optimal solutions among feasible alternatives. In this work, we propose Constrained Q-learning, a novel off-policy reinforcement learning framework restricting the action space directly in the Q-update to learn the optimal Q-function for the induced constrained MDP and the corresponding safe policy. In addition to single-step constraints referring only to the next action, we introduce a formulation for approximate multi-step constraints under the current target policy based on truncated value-functions. We analyze the advantages of Constrained Q-learning in the tabular case and compare Constrained DQN to reward shaping and Lagrangian methods in the application of high-level decision making in autonomous driving, considering constraints for safety, keeping right and comfort. We train our agent in the open-source simulator SUMO and on the real HighD data set.