论文标题
对具有长期限制的MDP的无模型算法和遗憾分析
Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints
论文作者
论文摘要
在优化动态系统时,变量通常具有约束。这些问题可以建模为受约束的马尔可夫决策过程(CMDP)。本文考虑了不知道过渡概率的问题的无模型方法。在存在长期(或平均)约束的情况下,代理必须选择一项政策,以最大程度地提高长期平均奖励,并满足每个情节中的平均约束。长期限制的主要挑战是,最佳策略一般不是确定性的,因此不能直接使用标准的Q学习方法。本文使用限制优化和Q学习的概念来提出具有长期约束的CMDP算法。对于任何$γ\ in(0,\ frac {1} {2})$,提出的算法被证明可以实现$ o(t^{1/2+γ})$遗憾的遗憾,以获得获得的奖励和$ o(t^{1-γ/2})$遗憾的是违反限制的违规行为,$ t $ t $是$ t $ septep $ septep $。我们注意到,这些是对MDP的遗憾分析的第一个结果,其中长期限制是不知道过渡概率的。
In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any $γ\in(0,\frac{1}{2})$, the proposed algorithm is shown to achieve $O(T^{1/2+γ})$ regret bound for the obtained reward and $O(T^{1-γ/2})$ regret bound for the constraint violation, where $T$ is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.